PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.07 (10-Oct-2009)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK

Association analysis

The basic association test is for a disease trait and is based on comparing allele frequencies between cases and controls (asymptotic and empirical p-values are available). Also implemented are the Cochran-Armitage trend test, Fisher's exact test, different genetic models (dominant, recessive and general), tests for stratified samples (e.g. Cochran-Mantel-Haenszel, Breslow-Day tests), a test for a quantitative trait; a test for differences in missing genotype rate between cases and controls; multilocus tests, using either Hotelling's T(2) statistic or a sum-statistic approach (evaluated by permutation) as well as haplotype tests. The basic tests can be performed with permutation, described in the following section to provide empirical p-values, and allow for different designs (e.g. by use of structured, within-cluster permutation). Family-based tests are described in the next section

HINT The basic association commands (--assoc, --model, --fisher, --linear and --logistic) will test only a single phenotype. If your alternate phenotype file contains more than one phenotype, then adding the --all-pheno flag will make PLINK cycle over each phenotype, e.g. instead of a single plink.assoc output file, if there are 100 phenotypes, PLINK will now show
Naturally, it will take 100 times longer... If you are testing a very large number of phenotypes, it might be worth specifying --pfilter also, to reduce the amount of amount (e.g. only outputing tests significant at p=1e-4 if --pfilter 1e-4 is specified).

Basic case/control association test

To perform a standard case/control association analysis, use the option:
plink --file mydata --assoc

which generates a file
which contains the fields:
     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     CHISQ   Basic allelic test chi-square (1df)
     P       Asymptotic p-value for this test
     OR      Estimated odds ratio (for A1, i.e. A2 is reference)

Hint In addition, if the optional command --ci X (where X is the desired coverage for a confidence interval, e.g. 0.95 or 0.99) is included, then two extra fields are appended to this output:
     L95        Lower bound of 95% confidence interval for odds ratio
     U95        Upper bound of 95% confidence interval for odds ratio 
(where 95 would change if a different value was used with the --ci option, naturally).

Adding the option
with --assoc will make PLINK report allele counts, rather than frequencies, in cases and controls.

See the next section on permutation to learn how to generate empirical p-values and use other aspects of permutation-based testing.

See the section on multimarker tests to learn how to perform haplotype-based tests of association.

This analysis should appropriately handle X/Y chromosome SNPs automatically.

Fisher's Exact test (allelic association)

To perform a standard case/control association analysis using Fisher's exact test to generate significance, use the option:
plink --file mydata --fisher

which generates a file
which contains the fields:
     CHR     Chromosome
     SNP     SNP ID
     BP      Physical position (base-pair)
     A1      Minor allele name (based on whole sample)
     F_A     Frequency of this allele in cases
     F_U     Frequency of this allele in controls
     A2      Major allele name
     P       Exact p-value for this test
     OR      Estimated odds ratio (for A1)
As described below, if --fisher is specified with --model as well, PLINK will perform genotypic tests using Fisher's exact test.

Note You can also use permutation to generate exact, empirical significance values that would also be valid in small samples, etc.

Alternate / full model association tests

It is possible to perform tests of association between a disease and a variant other than the basic allelic test (which compares frequencies of alleles in cases versus controls), by using the --model option. The tests offered here are (in addition to the basic allelic test):
  • Cochran-Armitage trend test
  • Genotypic (2 df) test
  • Dominant gene action (1df) test
  • Recessive gene action (1df) test
One advantage of the Cochran-Armitage test is that it does not assume Hardy-Weinberg equilibrium, as the individual, not the allele, is the unit of analysis (although the permutation-based empirical p-values from the basic allelic test also have this property). It is important to remember that SNPs showing severe deviations from Hardy-Weinberg are often likely to be bad SNPs, or reflect stratification in the sample, however, and so are probably best excluded in many cases.

The genotypic test provides a general test of association in the 2-by-3 table of disease-by-genotype. The dominant and recessive models are tests for the minor allele (which is the minor allele can be found in the output of either the --assoc or the --freq commands. That is, if D is the minor allele (and d is the major allele):
     Allelic:         D        versus      d
     Dominant:     (DD, Dd)    versus      dd
     Recessive:       DD       versus   (Dd, dd)
     Genotypic:       DD       versus      Dd         versus    dd
As mentioned above, these tests are generated with option:
plink --file mydata --model

which generates a file
which contains the following fields:
     CHR           Chromosome number
     SNP           SNP identifier
     TEST          Type of test
     AFF           Genotypes/alleles in cases
     UNAFF         Genotypes/alleles in controls
     CHISQ         Chi-squated statistic
     DF            Degrees of freedom for test
     P             Asymptotic p-value
Each SNP will feature on five rows of the output, correspondnig to the five tests applied. The column TEST refers to either ALLELIC, TREND, GENO, DOM or REC, refering to the different types of test mentioned above. The genotypic or allelic counts are given for cases and controls separately. For recessive and dominant tests, the counts represent the genotypes, with two of the classes pooled.

These tests only consider diploid genotypes: that is, for the X chromosome males will be excluded even from the ALLELIC test. This way the same data are used for the five tests presented here. Note that, in contrast, the basic association commands (--assoc and --linear, etc) include single male X chromosomes, and so the results may differ.

The genotypic and dominant/recessive tests will only be conducted if there is a minimum number of observations per cell in the 2-by-3 table: by default, if at least one of the cells has a frequency less than 5, then we skip the alternate tests (NA is written in the results file). The Cochran-Armitage and allelic tests are performed in all cases. This threshold can be altered with the --cell option:
plink --file mydata --model --cell 20

If permutation (with the --mperm or --perm options) is specified, the -model option will by default perform a permutation test based on the most significant result of ALLELIC, DOM and REC models. That is, for each SNP, the best original result will be compared against the best of these three tests for that SNP for every replicate. In max(T) permutation mode, this will also be compared against the best result from all SNPs for the EMP2 field. This procedure controls for the fact that we have selected the best out of three correlated tests for each SNP. The output will be generated in the file
depending on whether adaptive or max(T) permutation was used.

The behavior of the --model command can be changed by adding the --model-gen, --model-trend, --model-dom or --model-rec flags to make the permutation use the genotypic, the Cochram-Armitage trend test, the dominant test or the recessive test as the basis for permutation instead. In this case, one of the the following files will be generated:
     plink.model.gen.perm         plink.model.gen.mperm
     plink.model.trend.perm       plink.model.trend.mperm
     plink.model.dom.perm         plink.model.dom.mperm
     plink.model.rec.perm         plink.model.rec.mperm
It is also possible to add the --fisher flag to obtain exact p-values:
./plink --bfile mydata --model --fisher

in which case the CHISQ field does not appear. Note that the genotypic, allelic, dominant and recessive models use the Fisher's exact; the trend-test does not and will give the same p-value as without the --fisher flag. Also, by default, when --fisher is added, the --cell field is set to 0, i.e. to include all SNPs.

Stratified analyses

When a cluster variable has been specified, by pointing to a file that contains this information, with the --within command, it is possible to perform a number of tests of case/control association that take this clustering into account, or explicitly test for homogeneity of effect between clusters.

Note In many cases, permutation procedures can also be used to account for clusters in the data. See the next section for more details. The tests presented below are only applicable for case/control data, so permutation might be useful for quantitative trait outcomes, etc.

There are two basic classes of test:
  • Testing for overall disease/gene association, controlling for clusters
  • Testing for heterogeneity of the disease/gene assocation between different clusters
The type of cluster structure will vary in terms of how many clusters there are in the sample, and how many people belong to each cluster. At one extreme, we might have two only 2 clusters in the sample, each with a large number of cases and controls. At the other extreme, we might have a very large number of clusters, such that each cluster only has 2 individuals. These factors will influence the choice of stratified analysis.

The tests offered are:
  • Cochran-Mantel-Haenszel test for 2x2xK stratified tables
  • Cochran-Mantel-Haenszel test for IxJxK stratified tables
  • Breslow-Day test of homogeneity of odds ratio
  • Partitioning the total association chi-square to perform between and within cluster association, and a test of homogeneity of effect
The Cochran-Mantel-Haenszel (CMH) tests are valid with both a large number of small clusters and a small number of large clusters. These tests provide a test based on an "average" odds ratio that controls for the potential confounding due to the cluster variable.

The Breslow-Day test asks whether different clusters have different disease/gene odds ratios: this test assumes a moderate sample size within each cluster. The partitioning total association test, which is conceptually similar to the Breslow-Day test, also makes the same assumption.

As mentioned above, the CMH test comes in two flavours: 2x2xK and IxJxK. Currently, the 2x2xK test represents a { disease x SNP | cluster } test. The generalized form, the IxJxK, represents a test of { cluster x SNP | disease }, i.e. does the SNP vary between clusters, controlling for any possible true SNP/disease association. This latter test might be useful in interpreting significant associations in stratified samples. Typically, the first form of the test will be of more interest, however. These two tests are run by using the options:
plink --file mydata --mh --within mycluster.dat

for the basic CMH test, or
plink --file mydata --mh2 --within mycluster.dat

for the IxJxK test.

The --mh option generates the file
which contains the fields
     CHR       Chromosome number
     SNP       SNP identifier
     A1        Minor allele code
     A2        Major allele code
     BP        Physical position (base-pair)
     CHISQ     Cochran-Mantel-Haenszel statistic (1df)
     P         Asymptotic p-value for CMH test
     OR        CMH odds ratio
     L95       Lower bound on confidence interval for CMH odds ratio
     U95       Upper bound on confidence interval for CMH odds ratio
The range of the confidence interval with the --mh option can be changed with the --ci option:
plink --file mydata --mh --within mycluster.dat --ci 0.99

The --mh2 option generates the file
which contains the fields:
     CHR         Chromosome
     SNP         SNP identifier
     CHISQ_CMH2  Cochran-Mantel-Haenszel test for IxJxK tables
     P_CMH2      Asymptotic p-value for this test
It is not possible to obtain confidence intervals or odds ratios for --mh2 tests.

Hint A trick to analyse phenotypes with more two categories (but only with nominal, not ordinal outcomes) is to use the --mh2 option with the phenotype in the cluster file and the phenotype in the PED file set all to a single value.

Testing for heterogeneous association

As mentioned in the previous section, two methods are provided to test for between-cluster differences in association when using a case/control design. The Breslow-Day test is specified with the option:
plink --file mydata --bd --within myclst.txt

which runs and generates the same files as the --mh option, described above, but with two extra fields appended:
     CHISQ_BD   Breslow-Day test
     P_BD       Asymptotic p-value
where a significant value indicates between-cluster heterogeneoty in the odds ratios for the disease/SNP association.

A similar test of the homogeneity of odds ratio tests based on partitioning the chi-square statistic is given by:
plink --file mydata --homog --within myclst.txt

which generates the file
which contains the fields
     CHR      Chromosome number
     SNP      SNP identifier
     A1       Minor allele code
     A2       Major allele code
     F_A      Case allele frequency
     F_U      Control allele frequency
     N_A      Case allele count
     N_U      Control allele count
     TEST     Type of test
     CHISQ    Chi-squared association statistic
     DF       Degrees of freedom
     P        Asymptotic p-value
     OR       Odds ratio
The TEST type is either
     TOTAL    Total SNP & strata association
     ASSOC    SNP association controlling for strata
     HOMOG    Between-strata heterogeneity test
     X_1      Association in first stratum
     X_2      Association in second stratum

Hotelling's T(2) multilocus association test

IMPORTANT This command has been temporarily disabled

For disease-traits, PLINK provides support for a multilocus, genotype-based test using Hotelling's T2 (T-squared) statistic. The --set option should be used to specify which SNPs are to be grouped, as follows:
plink --file data --set mydata.set --T2

where mydata.set defines which SNPs are in which set (see this section for more information on defining sets).

This command will generate a file
which contains the fields
     SET      Set name
     SIZE     Number of SNPs in this set
     F        F-statistic from Hotelling's test
     DF1      Degrees of freedom 1
     DF2      Degrees of freedom 2
     P_HOTEL  Asymptotic p-value

HINT Use the --genedrop permutation to perform a family-based application of the Hotelling's T2 test. This command can be used with all permutation methods (label-swapping or gene-dropping, adaptive or max(T)). In fact, the permutation test is based on 1-p in order to make the between set comparisons for the max(T) statistic more meaningful (as different sized sets would have F-statistics with different degrees of freedom otherwise). Using permutation will generate one of the following files:
which contain the fields
     SET      Set name
     SIZE     Number of SNPs in this set
     EMP1     Empirical p-value
     NR       Number of permutation replicates
or, if --mperm was used,
which contain the fields
     SET      Set name
     SIZE     Number of SNPs in this set
     EMP1     Empirical p-value
     EMP2     max(T) empirical p-value
Note that this test uses a simple approach to missing data: rather than case-wise deletion (removing an individual if they have at least one missing observation) we impute the mean allelic value. Although this retains power under most scenarios, it can also cause some bias when there are lots of missing data points. Using permutation is a good way around this issue.

Quantitative trait association

Quantitative traits can be tested for association also, using either asymptotic (likelihood ratio test and Wald test) or empirical significance values. If the phenotype (column 6 of the PED file or the phenotype as specified with the --pheno option) is quantitative (i.e. contains values other than 1, 2, 0 or missing) then PLINK will automatically treat the analysis as a quantitative trait analysis. That is, the same command as for disease-trait association:
plink --file mydata --assoc

will generate the file
with fields as follows:
     CHR      Chromosome number
     SNP      SNP identifier
     BP       Physical position (base-pair)
     NMISS    Number of non-missing genotypes
     BETA     Regression coefficient
     SE       Standard error
     R2       Regression r-squared
     T        Wald test (based on t-distribtion)
     P        Wald test asymptotic p-value
If permutations were also requested, then an extra file, either
will be generated, depending on whether adaptive or max(T) permutation was used (see the next section for more details). The empirical p-values are based on the Wald statistic.

Genotype means for quantitative traits

Adding the flag --qt-means along with the --assoc command, when run with a quantitative trait, will produce an additional file with a list of means and standard deviations stratified by genotype, called
and format
     CHR     Chromosome code
     SNP     SNP identifier
     VALUE   Description of next three fields
     G11     Value for first genotype
     G12     Value for second genotype
     G22     Value for third genotype
where VALUE is one of GENO, COUNTS, FREQ, MEAN or SD (standard deviation). For example:
      CHR           SNP  VALUE      G11      G12      G22
        5   hCV26311749   GENO      2/2      2/1      1/1
        5   hCV26311749 COUNTS        1       60      597
        5   hCV26311749   FREQ  0.00152  0.09119   0.9073
        5   hCV26311749   MEAN   0.9367   0.4955   0.5074
        5   hCV26311749     SD        0    0.273   0.2902
        5     hCV918000   GENO      2/2      2/1      1/1
        5     hCV918000 COUNTS       47      237      359
        5     hCV918000   FREQ  0.07309   0.3686   0.5583
        5     hCV918000   MEAN    0.505   0.5091   0.5074
        5     hCV918000     SD   0.2867   0.3064   0.2797
i.e. each SNP takes up 5 rows.

Quantitative trait interaction (GxE)

PLINK provides the ability to test for a difference in association with a quantitative trait between two environments (or, more generally, two groups). This test is simply based on comparing the difference between two regression coefficients. To perform this test:
plink --file mydata --gxe --covar mycov.dat

where mycovar.txt is a file containing the following fields:
     Family ID
     Individual ID
     Covariate value
See the notes on covariate files for more details.

This option will generate the file
which contains the fields:
     CHR       Chromosome number
     SNP       SNP identifier
     NMISS1    Number of non-missing genotypes in first group (1)
     BETA1     Regression coefficient in first group
     SE1       Standard error of coefficient in first group
     NMISS2    As above, second group
     BETA2     As above, second group
     SE2       As above, second group
     Z_GXE     Z score, test for interaction
     P_GXE     Asymptotic p-value for this test
IMPORTANT! The covariate must be coded as an affection status variable, i.e. 1 or 2 representing the first or second group. Values of 0 or -9 can be used to indicate missing covariate values, in which case that individual will be excluded from analysis.

Linear and logistic models

These two features allow for multiple covariates when testing for both quantitative trait and disease trait SNP association, and for interactions with those covariates. The covariates can either be continuous or binary (i.e. for categorical covariates, you must first make a set of binary dummy variables).

WARNING! These commands are in some ways more flexible than the standard --assoc command, but this comes with a price: namely, these run more slowly...

In this section we consider:
  • Basic uasge
  • Covariate and interactions
  • Flexibly specifying the precise model
  • Flexibly specifying joint tests
Basic usage
For quantitative traits, use
plink --bfile mydata --linear

For disease traits, specify logistic regression with
plink --bfile mydaya --logistic

instead. All other commands in this section apply equally to both these models.

These commands will either generate the output file
depending on the phenotype/command used. The basic format is:
     CHR       Chromosome
     SNP       SNP identifier
     BP        Physical position (base-pair)
     A1        Tested allele (minor allele by default) 
     TEST      Code for the test (see below)
     NMISS     Number of non-missing individuals included in analysis
     BETA/OR   Regression coefficient (--linear) or odds ratio (--logistic)
     STAT      Coefficient t-statistic 
     P         Asymptotic p-value for t-statistic
For the additive effects of SNPs, the direction of the regression coefficient represents the effect of each extra minor allele (i.e. a positive regression coefficient means that the minor allele increases risk/phenotype mean). If the --beta command is added along with --logistic, then the regression coefficients rather than the odds ratios will be returned.

NOTE Elsewhere in this documentation, the term reference allele is sometimes used to refer to A1, i.e. the --reference-allele command can be used to specify which allele is A1. Note that in association testing, the odds ratios, etc are typically calculated with A2 as the actual reference allele (i.e. a positive OR means A1 increases risk relative to A2).

HINT Adding the --ci 0.95, for example, option will given 95% confidence intervals for the estimated parameters, in additional L95 and U95 fields in the output files.

By itself, the --linear command will give identical results to the Wald test from the --assoc command when applied to quantitative traits.The --logistic command may give slightly different results to the --assoc command for disease traits, but this is because a different test/model is being applied (i.e. logistic regression rather than allele counting). The difference may be particularly large for very rare alleles (i.e. if the SNP is monomorphic in cases or controls, then the logistic regression model is not well-defined and asymptotic results might not hold for the basic test either).

When using --linear, adding the option
will first standard the phenotype (mean 0, unit variance), so the resulting coefficients will be standardized.

The TEST column is by default ADD meaning the additive effects of allele dosage. Adding the option
will generate file which will have two extra tests per SNP, corresponding to two extra rows: DOMDEV and GENO_2DF which represent a separate test of the dominance component or a 2 df joint test of both additive and dominance (i.e. corresponding the the general, genotypic model in the --model command). Unlike the dominance model is the --model, DOMDEV refers to a variable coded 0,1,0 for the three genotypes AA,Aa,aa, i.e. representing the dominance deviation from additivity, rather specifying that a particular allele is dominant or recessive. That is, the DOMDEV term is fitted jointly with the ADD term in a single model.

NOTE! The coding PLINK uses with the 2 df --genotypic model involves two variables representing an additive effect and a dominance deviation;
          A   D
     AA   0   0
     AB   1   1
     BB   2   0
Although the 2df test will be identical, you would not expect to see similar p-values, etc for the two individual terms if instead you used a different version of "genotypic" coding, e.g. in another analysis package, such as using dummy variables to represent genotypes:
          G1   G2
     AA   0    0
     AB   1    0
     BB   0    1
That is, although fundamentally the same, in terms of the 2df test, the interpretation of the two individual terms is different in these two cases. To achieve this coding in PLINK (v1.02 onwards), add the --hethom flag as well as --genotypic.

In a related note, you would not always expect the ADD p-value to be the same when entering in the dominance term as it is without it; if in doubt, you are advised to stick to just interpreting the 2 df test if using the --genotypic option.


To specify a model assuming full dominance (or recessive) for the minor allele (i.e. rather than the 2 df model mentioned above), you can specify with either

Covariates and interactions
If a covariate file is also specified, then all covariates in that file will be included in the regression model, labelled COV1, COV2, etc. This is different to other commands which take only a single covariate (possibly working in conjunction with the --mcovar option).

NOTE The --covar-name or --covar-number commands can be used to select a subset of all covariates in the file, described here.

For example, if the covariate file is made as described here and contains 2 covariates then the command
plink --bfile mydata --linear --genotypic --covar mycov.txt

will add two extra tests per SNP, COV1 and COV2. The p-value for the SNP term or terms in the model will be adjusted for the covariates; that is, a single model is fit to the data (that also includes a dominance term, as the --genotypic flag was also set):
     Y = b0 + b1.ADD + b2.DOMDEV + b3.COV1 + b4.COV2 + e
(Note, using this notation, the genotypic test is of b1=b2=0.)

The output per each SNP might look something like:
    CHR        SNP      BP  A1       TEST   NMISS       OR      STAT         P
      5   rs000001   10001   A        ADD     664   0.7806    -1.942   0.05216
      5   rs000001   10001   A     DOMDEV     664   0.9395   -0.3562    0.7217
      5   rs000001   10001   A       COV1     664   0.9723   -0.7894    0.4299
      5   rs000001   10001   A       COV2     664    1.159    0.5132    0.6078
      5   rs000001   10001   A   GENO_2DF     664       NA     5.059    0.0797   
That is, this represent coefficients from four terms in a multiple regression of disease on ADD, DOMDEV, COV1 and COV2 jointly. The final test is a 2df test that tests the coefficients for ADD and DOMDEV together. Importantly, the p-values for each line reflect the effect of the entity under the TEST column, not of the SNP whilst controlling for that particular covariate. (That is, p=0.0797 is the 2df test of the SNP whilst controlling for COV1 and COV2.)

HINT To suppress the multiple lines of output for each covariate (which often are not of interest in themselves) add the flag --hide-covar, i.e. the above would just read as follows for this SNP:
    CHR        SNP      BP  A1       TEST   NMISS       OR      STAT         P
      5   rs000001   10001   A        ADD     664   0.7806    -1.942   0.05216
      5   rs000001   10001   A   GENO_2DF     664       NA     5.059    0.0797   

HINT To condition analysis on a specific SNP when using --linear or --logistic, use the --condition option, e.g.
plink --bfile mydata --linear --condition rs123456

will test all SNPs but adding the allelic dosage for rs123456 as a covariate. This command can be used in conjunction with --covar and the other options listed here. To condition on multiple SNPs, use, for example,
plink --bfile mydata --linear --condition-list snps.txt

where snps.txt is a plain text file contain a list of SNPs which are to be included as covariates. The output will now include terms that correspond to the SNPs listed in the file snps.txt.

The conditioning SNPs are entered into the model simply as covariates, using a simple 0, 1, 2 allele dosage coding. That is, for two conditioning SNPs, rs1001 and rs1002 say, and also a standard covariate, the model would be
     Y = b0 + b1.ADD + b2.rs1001 + b3.rs1002 + b4.COV1 + e
If the b1 coefficient for the test SNP is still significant after entering these covariates, this would suggest that it does indeeed have an effect independent of rs1001, rs1002 and the other covariate. (The other coefficients may still be highly significant, but these reflect the effects of the conditioning SNPs and covariates, not the test SNP.)

If the --sex flag is added, then sex will be entered as a covariate in the model (coded 1 for male, 0 for female), e.g
plink --bfile mydata --logistic --sex

If the option --interaction is added, then terms will be entered which correspond to SNP x covariate interactions (with DOMDEV as well as ADD if --genotypic is specified). In the case of two covariates, without --genotypic, for example, the command
plink --bfile mydata --linear --covar tmp.cov --interaction

results in the model
     Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV1 + b5.ADDxCOV2 + e 

NOTE Please remember that when interaction terms are included in the model, the significance of the main effects can not necessarily be interpreted straightforwardly (i.e. they will depend on the arbitrary coding of the variables). In otherwords, when including the --interaction flag, you should probably only interpret the interaction p-value. Please refer to any standard text of regression models if you are unclear on this.

Finally, a --test-all option drops all the terms in the model in a multiple degree of freedom test.

Flexibly specifying the model
Use command such as --covar and --interaction will automatically enter all covariates and possible SNP x covariate interactions. If one does not want to test all of these, then use the --parameters flag to extract only the ones of interest.

For example, to take the example above:
     Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV1 + b5.ADDxCOV2 + e 
If one only wanted ADD, the two covariates and the ADDxCOV2 but not the ADDxCOV1 interaction, then, from the above example, you could use
plink --bfile mydata --linear --covar tmp.cov --interaction --parameters 1,2,3,5

That is, --parameters takes a comman-separated list of integers, starting from 1, that represent the terms in the model (in the order in which they would appear if the command were run without the --parameters flag). In this case:
     ADD          [1]
     COV1         [2]
     COV2         [3]
     ADD x COV1   [4]  <-- excluded
     ADD x COV2   [5]
Flexibly specifying joint tests
To perform a user-defined joint test of more than one parameter, use the --tests option. This takes a comma-delimited set of parameter numbers, for example: if the model is
     ADD        [1]
     COV1       [2]
     COV2       [3]
     ADDxCOV1   [4]
     ADDxCOV2   [5]
plink --bfile mydate --linear --covar file.cov --interaction --tests 1,4,5

represeents a 3 degree of freedom test of ADD and the two interactions.

Note, if this is used in conjunction with the --parameters option, then the coding here refers to the reduced model -- for example, the command
plink --bfile mydate --linear --covar file.cov --interaction --parameters 1,2,3,5 --tests 1,4

performs a joint test of ADD and ADDxCOV2 (2df test) whilst controlling for main effects of COV1 and COV2, i.e. we do not use --tests 1,5, as there are now only 4 terms in the model:
                      --parameters 1,2,3,5    --tests 1,4
     ADD        [1]             [1]              TEST
     COV1       [2]             [2]              
     COV2       [3]             [3]              
     ADDxCOV1   [4]             n/a              
     ADDxCOV2   [5]             [4]              TEST
In other words, we fit the model
     Y = b0 + b1.ADD + b2.COV1 + b3.COV2 + b4.ADDxCOV2 + e 
and jointly test the hypothesis
     H0: b1 = b4 = 0 

As mentioned above, use --test-all to drop all terms in the model in a single joint test.
A common problem with multiple regression is that of multi-collinearity: when the predictor variables are too strongly correlated to each other, the parameter estimates will become unstable. PLINK tries to detect this, and will display NA for the test statistic and p-value for all terms in the model if there is evidence of multi-collinearity. One common instance where this would occur would be if one includes the --genotypic option but a SNP only has two of the three possible genotype classes: in this case, ADD and DOM will be perfectly correlated and PLINK will display NA for both tests; this is basically telling you that you should re-run without the --genotypic option for that particular SNP. Similar principles apply to including covariates and interactions terms: the more terms you include, the more likely you are to have problems.

The --vif option can be used to specify the variance inflation factor (VIF) used in the initial test for multicollinearity. The default value is 10 -- smaller values represent more stringent tests.

HINT If you have a quantitative trait, only want an additive model and have only a single binary covariate, use the --gxe option (described above) instead of --linear: it will run much faster (being based on a more simple test of the difference of two regression slopes; it will not necessarily give numerically identical results to the multiple regression approach, but asymptotically both tests should be similar).

Set-based tests

These set-based tests are particularly suited to large-scale candidate gene studies as opposed to whole genome association studies, as they use permutaiton.

NOTE The basis of the set-based test has been changed in version 1.04 onwards.

This analysis works as follows:
  1. For each set, for each SNP determine which other SNPs are in LD, above a certain threshold R
  2. Perform standard single SNP analysis (which might be basic case/control association, family-based TDT or quantitative trait analysis).
  3. For each set, select up to N "independent" SNPs (as defined in step 1) with p-values below P. The best SNP is selected first; subsequent SNPs are selected in order of descreasing statistical significance, after removing SNPs in LD with previously selected SNPs.
  4. From these subsets of SNPs, the statistic for each set is calculated as the mean of these single SNP statistics
  5. Permute the dataset a large number of times, keeping LD between SNPs constant (i.e. permute phenotype labels)
  6. For each permuted dataset, repeat steps 2 to 4 above.
  7. Empirical p-value for set (EMP1) is the number of times the permuted set-statistic exceeds the original one for that set.
Note that the empirical p-values are corrected for the multiple SNPs within a set (taking account of the LD between these SNPs). They are not corrected for multiple testing if there is more than one set, however (i.e. there is no equivalent of EMP2 (see the page on permutation).

The critical parameters described above, R, N and P can all be altered by the user, as described below.

To perform a set-based test the critical keywords are
     --set my.set 
     --mperm 10000
which state that we are performing a set-based test, which set-file to use and how many permutations to perform (this last command is necessary). As mentioned above, the --assoc command could be replaced by --tdt, or --logistic, etc.

The set file my.set is in form


For example,
plink --file mydata --set-test --set my.set --mperm 10000 --assoc

would display in the LOG file the following critical parameters with their default values
     Performed LD-based set test, with parameters:
          r-squared  (--set-r2)   = 0.5
          p-value    (--set-p)    = 0.05
          max # SNPs (--set-max)  = 5
The output is written to a file with a .set.mperm extension, for example
with the fields
     SET     Set name
     NSNP    Number of SNPs in set 
     NSIG    Total number of SNPs below p-value threshold
     ISIG    Number of significant SNPs also passing LD-criterion
     STAT    Average test statistic based on ISIG SNPs
     EMP1    Empirical set-based p-value
     SNPS    List of SNPs in the set
For example, here is output from a case/control dataset with SNPs for five related genes (lines truncated)
         SET   NSNP   NSIG   ISIG         STAT         EMP1 SNPS
      GABRB2     45      0      0            0            1 NA
      GABRA6      6      4      3        5.199      0.09489 rs3811991|rs2197414|...
      GABRA1     22     11      5        5.951      0.09459 rs4254937|rs4260711|...
      GABRG2     24      0      0            0            1 NA
       GABRP     17      2      1         7.64       0.0269 rs7736504
Here the first gene, GABRB2 has 45 SNPs, but none of these are significant at p=0.05, and so the empirival p-value is necessarily 1.00. The next gene has 6 SNPs, 4 of which are significant, but only 3 of which are independently significant based on an r-squared threshold of 0.5. The STAT of 5.199 is the average chi-squared statistic across these three SNPs. It should not be interpreted in itself -- rather, you should consider the EMP1 significance value based on it. In this case, P=0.095. The final gene, GABRP is nominally significant here, P=0.027, but this does not correct for the 5 genes tested of course.

Naturally, different thresholds will produce different results. Depending on the unknown genetic architecture, these may vary substantially and meaningfully so. In general, if the set represents a very large pathway (dozens of genes) you might want to increase --set-max. There are probably no hard and fast rules with regard to how to set --set-p and --set-r2, except to say that running under a large number of settings and selecting the most significant is not a good idea.

Running with a "stricter" set of values
     --set-r2 0.1
     --set-p 0.01
     --set-max 2
we see a broadly similar pattern of results; naturally, the thresholding on p-value means that GABRA6 goes from showing some signal to asbolutely no signal.
         SET   NSNP   NSIG   ISIG         STAT         EMP1 SNPS
      GABRB2     45      0      0            0            1 NA
      GABRA6      6      0      0            0            1 NA
      GABRA1     22      2      2        7.464      0.05949 rs4254937|rs4260711
      GABRG2     24      0      0            0            1 NA
       GABRP     17      1      1         7.64      0.06309 rs7736504
Alternatively, a more inclusive setting might be something like
     --set-r2 0.8
     --set-p 1
     --set-max 10
which, in this particular case, happens to yield slightly stronger signals for GABRA6 and GABRA1 but weaker for GABRp (lines truncated)
         SET   NSNP   NSIG   ISIG         STAT         EMP1 SNPS
      GABRB2     45     12     10        1.749       0.7162 hCV26311691|...
      GABRA6      6      6      6        3.998       0.0184 rs3811991|...
      GABRA1     22     13     10        5.277       0.0182 rs4254937|...
      GABRG2     24     11     10       0.6976       0.9099 hCV3167705|...
       GABRP     17     10     10        2.753       0.1225 rs7736504|...

HINT Two extremes are to perform a test based on a) the best single SNP result per set:
     --set-max 1
     --set-p 1     
or to use all SNPs in a set:
     --set-max 99999
     --set-p 1
     --set-r2 1

Adjustment for multiple testing: Bonferroni, Sidak, FDR, etc

To generate a file of adjusted significance values that correct for all tests performed and other metrics, use the option:
plink --file mydata --assoc --adjust

which generates the file
which contains the fields
     CHR         Chromosome number
     SNP         SNP identifer
     UNADJ       Unadjusted p-value
     GC          Genomic-control corrected p-values
     BONF        Bonferroni single-step adjusted p-values
     HOLM        Holm (1979) step-down adjusted p-values
     SIDAK_SS    Sidak single-step adjusted p-values
     SIDAK_SD    Sidak step-down adjusted p-values
     FDR_BH      Benjamini & Hochberg (1995) step-up FDR control
     FDR_BY      Benjamini & Yekutieli (2001) step-up FDR control 
This file is sorted by significance value rather than genomic location, the most significant results being at the top.

WARNING Currently, these procedures are only implemented for asymptotic significance values for the standard TDT and association (disease trait and quantitative trait, --assoc, --linear, --logistic) tests and the 2x2xK Cochran-Mantel-Haenszel test. Future versions will allow these results for empirical significance values and for other tests (e.g. epistasis, etc).


This document last modified Saturday, 10-Oct-2009 11:02:41 EDT