PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.07 (10-Oct-2009)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK

Summary statistics

PLINK will generate a number of standard summary statistics that are useful for quality control (e.g. missing genotype rate, minor allele frequency, Hardy-Weinberg equilibrium failures and non-Mendelian transmission rates). These can also be used as thresholds for subsequent analyses (described in the next section).

All the per-SNP summary statistics described below are conducted after removing individuals with high missing genotype rates, as defined by the --mind option. The default value of which is 0 however, i.e. do not exclude any individuals.

NOTE Regarding the calculation of genotype rates for sex chromosomes: for the Y, females are ignored completely. For the males, heterozygous X and heterozygous Y genotypes are treated as missing. Having the correct designation of gender is therefore important to obtain accurate genotype rate estimates, or avoid incorrectly removing samples, etc.

Missing genotypes

To generate a list genotyping/missingness rate statistics:
plink --file data --missing

This option creates two files:
which detail missingness by individual and by SNP (locus), respectively. For individuals, the format is:
     FID                Family ID
     IID                Individual ID
     MISS_PHENO         Missing phenotype? (Y/N)
     N_MISS             Number of missing SNPs
     N_GENO             Number of non-obligatory missing genotypes
     F_MISS             Proportion of missing SNPs
For each SNP, the format is:
     SNP                SNP identifier
     CHR                Chromosome number
     N_MISS             Number of individuals missing this SNP
     N_GENO             Number of non-obligatory missing genotypes
     F_MISS             Proportion of sample missing for this SNP

HINT To test for case/control differences in missingness, see the --test-missing option.

HINT To produce summary of missingness that is stratified by a categorical cluster variable, use the --within filename option as well as --missing. In this way, the missing rates will be given separately for each level of the categorical variable. For example, the categorical variable could be which plate that sample was on in the genotyping. Details on the format of a cluster file can be found here.

Obligatory missing genotypes

Often genotypes might be missing obligatorarily rather than because of genotyping failure. For example, some proportion of the sample might only have been genotyped on a subset of the SNPs. In these cases, one might not want to filter out SNPs and individuals based on this type of missing data. Alternatively, genotypes for specific plates (sets of SNPs/individuals) might have been blanked out with the --zero-cluster option, but you still might want to be able to sensibly set missing data thresholds.

HINT See the section on data management to see how to make missing certain sets of genotypes.

Two functions allow these 'obligatory missing' values to be identified and subsequently handled specially during the filtering steps:
plink --bfile mydata --oblig-missing --oblig-clusters myfile.clst --assoc

This command applies the default genotyping thresholds (90% per individual and per SNP) but accounting for the fact that certain SNPs are obligatory missing (with the 90% only refers to those SNPs actually attempted, for example). The file specified by --oblig-clusters has the same format as a cluster file (except only a single cluster field is allowed here, i.e. only 3 columns). For example,
1  1 0 0 1 1   A A  C C  A A 
2  1 0 0 1 1   C C  A A  C C
3  1 0 0 1 1   A C  A A  A C 
4  1 0 0 1 1   A A  C C  A A 
5  1 0 0 1 1   C C  A A  C C
6  1 0 0 1 1   A C  A A  A C 
1b 1 0 0 1 1   A A  0 0  0 0
2b 1 0 0 1 1   C C  0 0  0 0
3b 1 0 0 1 1   A C  0 0  0 0
4b 1 0 0 1 1   A A  0 0  0 0
5b 1 0 0 1 1   C C  0 0  0 0
6b 1 0 0 1 1   A C  0 0  0 0
and MAP file
     1 snp1 0 1000
     1 snp2 0 2000
     1 snp3 0 3000
If the obligatory missing file, test.oblig is
     snp2   C1
     snp3   C1
it implies that SNPs snp2 and snp3 are obligatory missing for all individuals belonging to cluster C1. The corresponding cluster file is test.clst
     1b 1 C1
     2b 1 C1
     3b 1 C1
     4b 1 C1
     5b 1 C1
     6b 1 C1
indicating that the last six individuals belong to cluster C1. (Not all individuals need be specified in this file.)

NOTE You can have more than one cluster category specified in these files (i.e. implying different patterns of obligatory missing data for different sets of individuals).

Running a --missing command on the basic fileset, ignoring the obligatory missing nature of some of the data, results in the following:
plink --file test --missing

which shows in the LOG file that 6 individuals were removed because of missing data
     6 of 12 individuals removed for low genotyping ( MIND > 0.1 )
and the corresponding output files (plink.imiss and plink.lmiss) indicate no missing data (purely because the six individuals with 2 of 3 genotypes missing were already filtered out and everybody else left happens to have complete genotyping).
        1    1          N        0        0
        2    1          N        0        0
        3    1          N        0        0
        4    1          N        0        0
        5    1          N        0        0
        6    1          N        0        0
      CHR  SNP   N_MISS   F_MISS
        1 snp1        0        0
        1 snp2        0        0
        1 snp3        0        0
In contrast, if the obligatory missing data are specified as follows:
plink --file test --missing --oblig-missing test.oblig --oblig-clusters test.clst

we now see
     0 of 12 individuals removed for low genotyping ( MIND > 0.1 )
and the corresponding output files now include an extra field, N_GENO, which indicates the number of non-obligatory missing genotypes, which is the denominator for the genotyping rate calculations
        1    1          N        0        3        0
        2    1          N        0        3        0
        3    1          N        0        3        0
        4    1          N        0        3        0
        5    1          N        0        3        0
        6    1          N        0        3        0
       1b    1          N        0        1        0
       2b    1          N        0        1        0
       3b    1          N        0        1        0
       4b    1          N        0        1        0
       5b    1          N        0        1        0
       6b    1          N        0        1        0
        1 snp1        0       12        0
        1 snp2        0        6        0
        1 snp3        0        6        0
Seen another way, if one specified --mind 1 to include all individuals (i.e. not apply the default 90% genotyping rate threshold for each individual before this step), then the results would not change with the obligatory missing specification in place, as expected; in contrast, without the specification of obligatory missing data, we would see
        1    1          N        0        0
        2    1          N        0        0
        3    1          N        0        0
        4    1          N        0        0
        5    1          N        0        0
        6    1          N        0        0
       1b    1          N        2 0.666667
       2b    1          N        2 0.666667
       3b    1          N        2 0.666667
       4b    1          N        2 0.666667
       5b    1          N        2 0.666667
       6b    1          N        2 0.666667
      CHR  SNP   N_MISS   F_MISS
        1 snp1        0        0
        1 snp2        6      0.5
        1 snp3        6      0.5
In this not particularly exciting example, there are no missing genotypes that are non-obligatory missing (i.e. that not specified by the two files) -- if there were, it would counted appropriately in the above files, and used to filter appropriately also.

NOTE All subsequent analyses do not distingush whether genotypes were missing due to failure or were obligatory missing -- that is, this option only effects the behavior of the --mind and --geno filters.

NOTE If a genotype is set to be obligatory missing but actually in the genotype file it is not missing, then it will be set to missing and treated as if missing.

Cluster individuals based on missing genotypes

Systematic batch effects that induce missingness in parts of the sample will induce correlation between the patterns of missing data that different individuals display. One approach to detecting correlation in these patterns, that might possibly idenity such biases, is to cluster individuals based on their identity-by-missingness (IBM). This approach use exactly the same procedure as the IBS clustering for population stratification, except the distance between two individuals is based not on which (non-missing) allele they have at each site, but rather the proportion of sites for which two individuals are both missing the same genotype.

To use this option:
plink --file data --cluster-missing

which creates the files:
which have similar formats to the corresponding IBS clustering files. Specifically, the plink.mdist.missing file can be subjected to a visualisation technique such as multidimensinoal scaling to reveal any strong systematic patterns of missingness.

Note The values in the .mdist file are distances rather than similarities, unlike for standard IBS clustering. That is, a value of 0 means that two individuals have the same profile of missing genotypes. The exact value represents the proportion of all SNPs that are discordantly missing (i.e. where one member of the pair is missing that SNP but the other individual is not).

The other constraints (significance test, phenotype, cluster size and external matching criteria) are not used during IBM clustering. Also, by default, all individuals and all SNPs are included in an IBM clustering analysis, unlike IBS clustering, i.e. even individuals or SNPs with very low genotyping, or monomorphic alleles. By explicitly specifying --mind or --geno or --maf certain individuals or SNPs can be excluded (although the default is probably what is usually required for quality control procedures).

Test of missingness by case/control status

To obtain a missing chi-sq test (i.e. does, for each SNP, missingness differ between cases and controls?), use the option:
plink --file mydata --test-missing

which generates a file
which contains the fields
     CHR         Chromosome number
     SNP         SNP identifier
     F_MISS_A    Missing rate in cases
     F_MISS_U    Missing rate in controls
     P           Asymptotic p-value (Fisher's exact test)
The actual counts of missing genotypes are available in the plink.lmiss file, which is generated by the --missing option.

Note This test is only applicable to case/control data.

Haplotype-based test for non-random missing genotype data

The previous test asks whether genotypes are missing at random or not with respect to phenotype. This test asks whether or not genotypes are missing at random with respect to the true (unobserved) genotype, based on the observed genotypes of nearby SNPs.

Note This test assumes dense SNP genotyping such that flanking SNPs are typically in LD with each other. Also bear in mind that a negative result on this test may simply reflect the fact that there is little LD in the region.

This test works by taking a SNP at a time (the 'reference' SNP) and asking whether haplotype formed by the two flanking SNPs can predict whether or not the individual is missing at the reference SNP. The test is a simple haplotypic case/control test, where the phenotype is missing status at the reference SNP. If missingness at the reference is not random with respect to the true (unobserved) genotype, we may often expect to see an association between missingness and flanking haplotypes.

Note Again, just because we might not see such an association does not necessarily mean that genotypes are missing at random -- this test has higher specificity than sensitivity. That is, this test will miss a lot; but, when used as a QC screening tool, one should pay attention to SNPs that show highly significant patterns of non-random missingness.

This option is run with the command:
plink --file data --test-mishap

which generates an output file called
which has the fields
     LOCUS        Reference SNP
     HAPLOTYPE    Flanking haplotype, or heterozygosity
     F_0          Frequency of HAPLOTYPE if missing reference SNP
     F_1          Frequency of HAPLOTYPE if not missing reference SNP
     M_H1         N missing/not missing for HAPLOTYPE
     M_H2         N missing/not missing for not-HAPLOTYPE
     CHISQ        Chisquare test for non-random missingness
     P            Asymptotic p-value
     SNPS         Identifier for flanking SNPs
The HAPLOTYPE typically represents each two-SNP flanking haplotype (i.e. not including the reference SNP itself); each reference SNP will also have a row labelled HETERO in this column, which means we are testing whether or not being heterozygous for the flanking haplotypes (which would, under many sets of haplotype frequencies, increase the chance of being heterozygous for the reference SNP). SNPs with no or very little missing genotype data are skipped. Only haplotypes above the --maf threshold are used in analysis.

Here is an example from real data (rows split into two sets for clarity):
     LOCUS         HAPLOTYPE        F_0       F_1         M_H1         M_H2    
     rs17012390           CT     0.5238   0.01949       55/104      50/5233    
     rs17012390           TC     0.4762    0.9805      50/5233       55/104    
     rs17012390       HETERO          1   0.04252       56/114       0/2567    

     LOCUS         HAPLOTYPE      CHISQ         P  SNPS
     rs17012390           CT      923.4         0  rs17012387|rs17012393
     rs17012390           TC      923.4         0  rs17012387|rs17012393
     rs17012390       HETERO      863.3         0  rs17012387|rs17012393
This clearly shows a huge chi-square (the sample is large, N of over 2500 individuals). We see that of 56 missing genotypes for this reference SNP, all occur when the flanking haplotypic background is heterozygous (i.e. M_H1 shows 56/114, indicating that there are 114 other instances of a heterozygous haplotypic background when the reference SNP is not missing) whereas we see not a single missing call when the flanking SNP background is homozygous, of which we see 2567 observations. This is clearly indicative of non-random association between the unobserved genotype and missing status.

Looking at the same data a different way, F_1 indicates that the majority of the sample (people not missing at the reference SNP) have haplotype frequencies of CT and TC haplotypes at approximately 0.02 and 0.98 respectively). In contrast, because all people missing this SNP are on heterozygous backgrounds, these frequencies become approximately 50:50 in this group (shown in F_0).

In the particular dataset this example comes from, this SNP would have passed a standard quality control test. The --hardy command shows that this SNP does not failure the HWE test; also, it does not show excessive amounts of missing data (the --missing command indicates a missing rate of 0.021). The genotype counts (obtained by the --hardy option) are, for the whole sample, 0/104/2584.

In contrast, here are the same results for a different SNP that does not show any evidence of non-random missingness.
          LOCUS    HAPLOTYPE        F_0       F_1         M_H1         M_H2    
      rs3912752           CC    0.07692   0.06507        2/354      24/5086
      rs3912752           TT     0.1154     0.205       3/1115      23/4325
      rs3912752           CT     0.8077      0.73      21/3971       5/1469
      rs3912752       HETERO     0.2308    0.4279       3/1164      10/1556

          LOCUS    HAPLOTYPE      CHISQ          P   SNPS
      rs3912752           CC    0.05967      0.807   rs3912751|rs351596
      rs3912752           TT      1.276     0.2586   rs3912751|rs351596
      rs3912752           CT     0.7938     0.3729   rs3912751|rs351596
      rs3912752       HETERO      2.056     0.1516   rs3912751|rs351596
Here we do not see any deviation between the flanking haplotype frequencies between people missing versus genotyped for the reference SNP. Of course, there is less missingness for this SNP (26 missing genotypes) so we might expect power is lower, even if there were non-random missingness. This only highlights the point made above that, in general, significant results are more interpretable than non-signficant results for this test. But more importantly, if there are only a handful of missing genotypes, we do not particular care whether or not they are missing at random, as they would not bias the association with disease in any case. Of course, whether there is non-random genotyping error is another question...

By default, we currently just select exactly two flanking SNPs. This can be changed with the option --mishap-window. For example,
plink --bfile mydata --test-mishap --mishap-window 4

Future releases will feature a more intelligent selection of flanking markers.

Note This routine currently skips the SNPs on the X and Y chromosomes.

Hardy-Weinberg Equilibrium

To generate a list of genotype counts and Hardy-Weinberg test statistics for each SNP, use the option:
plink --file data --hardy

which creates a file:
This file has the following format
     SNP             SNP identifier
     TEST            Code indicating sample
     A1              Minor allele code
     A2              Major allele code
     GENO            Genotype counts: 11/12/22 
     O(HET)          Observed heterozygosity
     E(HET)          Expected heterozygosity
     P               H-W p-value
For case/control samples, each SNP will have three entries (rows) in this file, with TEST being either ALL, AFF (cases only) or UNAFF (controls only). For quantitative traits, only a single row will appear for each SNP, labelled ALL(QT).

Only founders are considered for the Hardy-Weinberg calculations -- ie. for family data, any offspring are ignored.

WARNING By default, this procedure only considers founders, so no HW results would be given for sibling-only datasets (i.e. if no parents exist). To perform a rough, somewhat biased test, use the --nonfounders option which means that all individuals will be included. Alternatively, manually extract one person per family for this calculation and recode these individuals as founders (see the --keep option to facilitate this).

The default test is an exact one, described and implemented by Wigginton et al (see reference below), which is more accurate for rare genotypes. You can still perform the standard asymptotic test with the --hardy2 option.

A Note on Exact Tests of Hardy-Weinberg Equilibrium.
Wigginton JE, Cutler DJ and Abecasis GR
Am J Hum Genet (2005) 76: 887-93

Allele frequency

To generate a list of minor allele frequencies (MAF) for each SNP, based on all founders in the sample:
plink --file data --freq

will create a file:
with five columns:
     CHR       Chromosome
     SNP       SNP identifier
     A1        Allele 1 code (minor allele)
     A2        Allele 2 code (major allele)
     MAF       Minor allele frequency
     NCHROBS   Non-missing allele count

HINT To produce summary of allele frequencies that is stratified by a categorical cluster variable, use the --within filename option as well as --missing. In this way, the frequencies will be given separately for each level of the categorical variable. Details on the format of a cluster file can be found here.

NOTE If a SNP fails the genotyping rate threshold (as set by the --geno value, which is by default 0.0, i.e. no SNPs will fail) the frequency will appear as NA in the plink.frq output file. To obtain frequencies on all SNPs irrespective of genotyping rate, set --mind 1.

Linkage disequilibrium based SNP pruning

Sometimes it is useful to generate a pruned subset of SNPs that are in approximate linkage equilibrium with each other. This can be achieved via two commands: --indep which prunes based on the variance inflation factor (VIF), which recursively removes SNPs within a sliding window; second, --indep-pairwise which is similar, except it is based only on pairwise genotypic correlation.

Hint The output of either of these commands is two lists of SNPs: those that are pruned out and those that are not. A separate command using the --extract or --exclude option is necessary to actually perform the pruning.

The VIF pruning routine is performed:
plink --file data --indep 50 5 2

will create files
Each is a simlpe list of SNP IDs; both these files can subsequently be specified as the argument for a --extract or --exclude command.

The parameters for --indep are: window size in SNPs (e.g. 50), the number of SNPs to shift the window at each step (e.g. 5), the VIF threshold. The VIF is 1/(1-R^2) where R^2 is the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously. That is, this considers the correlations between SNPs but also between linear combinations of SNPs. A VIF of 10 is often taken to represent near collinearity problems in standard multiple regression analyses (i.e. implies R^2 of 0.9). A VIF of 1 would imply that the SNP is completely independent of all other SNPs. Practically, values between 1.5 and 2 should probably be used; particularly in small samples, if this threshold is too low and/or the window size is too large, too many SNPs may be removed.

The second procedure is performed:
plink --file data --indep-pairwise 50 5 0.5

This generates the same output files as the first version; the only difference is that a simple pairwise threshold is used. The first two parameters (50 and 5) are the same as above (window size and step); the third parameter represents the r^2 threshold. Note: this represents the pairwise SNP-SNP metric now, not the multiple correlation coefficient; also note, this is based on the genotypic correlation, i.e. it does not involve phasing.

To give a concrete example: the command above that specifies 50 5 0.5 would a) consider a window of 50 SNPs, b) calculate LD between each pair of SNPs in the window, b) remove one of a pair of SNPs if the LD is greater than 0.5, c) shift the window 5 SNPs forward and repeat the procedure.

To make a new, pruned file, then use something like (in this example, we also convert the standard PED fileset to a binary one):
plink --file data --extract --make-bed --out pruneddata


Mendel errors

To generate a list of Mendel errors for SNPs and families, use the option:
plink --file data --mendel

which will create files:
The *.mendel file contains all Mendel errors (i.e. one line per error); the *.imendel file contains a summary of per-individual error rates; the *.fmendel file contains a summary of per-family error rates; the *.lmendel file contains a summary of per-SNP error rates.

The *.mendel file has the following columns:
     FID            Family ID
     KID            Child individual ID
     CHR            Chromosome
     SNP            SNP ID
     CODE           A numerical code indicating the type of error (see below)
     ERROR          Description of the actual error
The error codes are as follows:
     Code   Pat  ,   Mat   ->    Offspring

     1      AA   ,   AA    ->    AB       
     2      BB   ,   BB    ->    AB

     3      BB   ,   **    ->    AA
     4      **   ,   BB    ->    AA
     5      BB   ,   BB    ->    AA

     6      AA   ,   **    ->    BB
     7      **   ,   AA    ->    BB
     8      AA   ,   AA    ->    BB

     9      **   ,   AA    ->    BB    (X chromosome male offspring)
     10     **   ,   BB    ->    AA    (X chromosome male offspring)

The *.lmendel file has the following columns:
     CHR            Chromosome
     SNP            SNP ID
     N              Number of Mendel errors for this SNP

The *.imendel file has the following columns:
     FID            Family ID
     IID            Individual ID
     N              Number of errors this individual was implicated in
The following heurtistic is used to provide a rough estimate of Mendel error rare 'per individual': error types 1 and 2 count for all 3 individuals (child, father, mother); error types 5 and 8 count only for the child (i.e. otherwise requires two errors, one in each parent); error types 3 and 6 count for the child and the father; all other types (4, 7, 9 and 10) count for the offspring and the mother. This metric might indicate that, for example, in a nuclear family with two parents and two offspring, many more Mendel errors can be associated with the first sibling; the remaining trio might not show any increased rate.

Currently, PLINK only scans full trios for Mendel errors. Families with fewer than 2 parents in the dataset will not be tested.

Finally, the *.fmendel file has the following columns:
     FID            Family ID
     PAT            Paternal individual ID
     MAT            Maternal individual ID
     CHLD           Number of offspring in this (nuclear) family
     N              Number of Mendel errors for this (nuclear) family

Sex check

This option uses X chromosome data to determine sex (i.e. based on heterozygosity rates) and flags individuals for whom the reported sex in the PED file does not match the estimated sex (given genomic data). To run this analysis, use the flag:
plink --bfile data --check-sex

which generates a file
which contains the fields
     FID     Family ID
     IID     Individual ID
     PEDSEX  Sex as determined in pedigree file (1=male, 2=female)
     SNPSEX  Sex as determined by X chromosome
     STATUS  Displays "PROBLEM" or "OK" for each individual
     F       The actual X chromosome inbreeding (homozygosity) estimate
A PROBLEM arises if the two sexes do not match, or if the SNP data or pedigree data are ambiguous with regard to sex. A male call is made if F is more than 0.8; a femle call is made if F is less than 0.2.

The command
plink --bfile data --impute-sex --make-bed --out newfile

will impute the sex codes based on the SNP data, and create a new file with the revised assignments, in this case a new binary fileset.

Pedigree errors

PLINK can accept multigenerational family data for family-based tests and Mendel error checks. It will break multigenerational families down into nuclear family units where appropriate. Extended family information is not used in an optimal manner, however (e.g. to help find Mendel errors using grandparental genotypes if parental genotypes are missing).

Unless PLINK is explicitly told to perform a family-based analysis, it will ignore any pedigree structure in the sample and analyse the data as if all individuals are unrelated (i.e. the --assoc option, for example, will ignore family structure). It is therefore the responsibility of the user to ensure that the data are appropriate for the type of test (e.g. if performing a standard association test with --assoc, this implies that all individuals should be unrelated for asymptotic significance values to be correct). The exception to this general rule is that certain summary statistics are based only on founders.

PLINK will spot most pedigree errors (e.g. if an individual has two fathers, for example). For a more comprehensive evaluation of pedigree errors (invalid or incompletely specified pedigree structures) please use a different software package such as PEDSTATS or famtypes.

This document last modified Thursday, 01-Oct-2009 08:00:57 EDT