PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.03 (10-Jun-2008)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | Haplotypes | Conditional tests | Proxy association | Imputation | Clumping | Epistasis | Copy Number Variation | R-plugins | SNP annotation | Simulation | Profiles | Resources | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. Multimarker tests 15. Conditional haplotype tests 16. Proxy association 17. Full imputation (beta) 18. LD-based results clumping 19. Epistasis 20. Copy Number Variation 21. R-plugins 22. SNP annotation lookup 23. Simulation tools 24. Profile scoring 25. Resources 26. Miscellaneous 27. FAQ & Hints

28. gPLINK
 

Epistasis

For disease-trait population-based samples, it is possible to test for epistasis. The epistasis test can either be case-only or case-control. All pairwise combinations of SNPs can be tested: although this may or may not be desirable in statistical terms, it is computationally feasible for moderate datasets using PLINK, e.g. the 4.5 billion two-locus tests generated from a 100K data set took just over 24 hours to run, for approximately 500 individuals (with the --fast-epistasis command). Alternatively, sets can be specified (e.g. to test only the most significant 100 SNPs against all other SNPs, or against themselves, etc). The output consists only pairwise epistatic results above a certain significance value; also, for each SNP, a summary of all the pairwise epistatic tests is given (e.g. maximum test, proportion of tests significant at a certain threshold, etc). To test for gene-by-environment interaction, see either the section on stratified analyses for disease traits, or the section on QTL GxE for quantitative traits.

IMPORTANT! These tests for epistasis are currently only applicable for population-based samples, not family-based.

SNP x SNP epistasis

To test SNP x SNP epistasis for case/control population-based samplse, use the command
plink --file mydaya --epistasis

which will send output to the files
     plink.epi.cc
     plink.epi.cc.summary
where cc = case-control; for quantitative traits, cc will be replaced by qt.

The default test uses either linear or logistic regression, depending on whether the phenoype is a quantitative or binary trait. PLINK makes a model based on allele dosage for each SNP, A and B, and fits the model
     Y ~ b0 + b1.A + b2.B + b3.AB + e 
The test for interaction is based on the coefficient b3.

Hint For disease traits only, an approximate but faster method can be used to screen for epistasis: use the --fast-epistasis command instead of --epistasis. This test is based on a Z-score for difference in SNP1-SNP2 assocation (odds ratio) between cases and controls (or in cases only, in a case-only analysis). If you use this to screen a large number of SNPs, you should probably report the more standard logistic regression test value also. In practice, both approaches usually give similar results, which justifies the use of --fast-epistasis as a screening tool for a computationally-demanding problem. Of course, given a specific (and often extreme) threshold, --epi1, the exact above-threshold list of SNPs will not always be the same; if you choose to use this approach, it is probably wise to apply it to select a subset of pairs of SNPs below a reasonably liberal --epi1 threshold to be tested with the more standard --epistasis command.

Important The --epistasis command is set up for testing a potentially very large number of SNP by SNP comparisons, most of which would not be significant or of interest. Because the output may contains millions or billions of line, the default is to only output tests with p-values less than 1e-4, as specified by the --epi1 option (see below). If your dataset is much smaller and you definitely want to see all the output, add --epi1 1 . If you do not, odds are you'll see a blank output file except for the header (i.e. immediately telling you that none of the tests were significant at 1e-4).

Specifying which SNPs to test

There are different modes for specifying which SNPs are tested:
ALL x ALL
plink --file mydata --epistasis

SET1 x SET1  { where epi.set contains only 1 set }
plink --file mydata --epistasis --set-test --set epi.set

SET1 x ALL  { where epi.set contains only 1 set } 
plink --file mydata --epistasis --set-test --set epi.set --set-by-all

SET1 x SET2  { where epi.set contains 2 sets }  
plink --file mydata --epistasis --set-test --set epi.set

For the 'symmetrical' cases (ALLxALL and SET1xSET1) then only unique pairs are analysed.

For the other two cases (SET1xALL, SET1xSET2) then all pairs are analysed (e.g. will perform SNPA x SNPB as well as SNPB x SNPA, if A and B are in both SET1 and SET2). It will not try to analysis SNPA x SNPA however.

The output

The output can be controlled via
plink --file mydata --epistasis --epi1 0.0001

which means only record results that are significant p<=0.0001. (This prevents too much output from being generated). The output is in the form
     CHR1    Chromosome of first SNP
     SNP1    Identifier for first SNP
     CHR2    Chromosome of second SNP
     SNP2    Identifier for second SNP
     OR_INT  Odds ratio for interaction
     Z       Z score for test of odds ratio
     P       Asymptotic p-value

A second part of the output: for each SNP in SET1, or in ALL if no sets were specified, is information about the number of significant epistatic tests that SNP featured in (i.e. either with ALL other SNPs, with SET1, or with SET2). The threshold --epi2 determines this:
plink --file mydata --epistasis --epi1 0.0001 --epi2 0.05

The output in the plink.epi.cc.summary file containts the following fields:
     CHR        Chromosome
     SNP        SNP identifier
     N_SIG      # significant epistatic tests (p <= "--epi2" threshold)
     N_TOT      # of valid tests (i.e. non-zero allele counts, etc)
     PROP       Proportion significant of valid tests
     BEST_CHISQ Highest statistic for this SNP 
     BEST_CHR   Chromosome of best SNP
     BEST_SNP   SNP identifier of best SNP
This file should be interpreted as giving only a very rough idea about the extent of epistasis and which SNPs seem to be interacting (although, of course, this is a naive statistic as we do not take LD into account -- i.e. PROP does not represent the number of independent epistatic results).

Case-only epistasis

For case-only epistatic analysis,
plink --file mydata --fast-epistasis --case-only

sends output to (co = case-only)
     plink.epi.co
     plink.epi.co.summary
All other options are as described above.

Currently, in case-only analysis, only SNPs that are more than 1 Mb apart, or on different chromosomes, are included in case-only tests. This behavior can be changed with the --gap option, with the distance specified kb: for example, to specify a gap of 5 Mb,
plink --file mydata --fast-epistasis --case-only --gap 5000

This option is important, as the case-only test for epistasis assumes that the two SNPs are in linkage equilibrium in the general population.

Gene-based tests of epistasis

WARNING This test is still under heavy development and not ready for use.
 
This document last modified Wednesday, 11-Jun-2008 18:14:43 EDT