PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.03 (10-Jun-2008)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | Haplotypes | Conditional tests | Proxy association | Imputation | Clumping | Epistasis | Copy Number Variation | R-plugins | SNP annotation | Simulation | Profiles | Resources | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. Multimarker tests 15. Conditional haplotype tests 16. Proxy association 17. Full imputation (beta) 18. LD-based results clumping 19. Epistasis 20. Copy Number Variation 21. R-plugins 22. SNP annotation lookup 23. Simulation tools 24. Profile scoring 25. Resources 26. Miscellaneous 27. FAQ & Hints

28. gPLINK
 

Inclusion thresholds

This secion describes options that can be used to filter out individuals or SNPs on the basis of the summary statistic measures described in the previous summary statistics page.
Summary statistics versus inclusion criteria
The following table summarizes the relationship between the commands to generate summary statistics (as described on the previous page, versus the commands to exclude individuals and/or markers, which are described on this page.

Feature As summary statistic As inclusion criteria
Missingness per individual --missing --mind N
Missingness per marker --missing --geno N
Allele frequency --freq --maf N
Hardy-Weinberg equilibrium --hardy --hwe N
Mendel error rates --mendel --me N M
Default threshold values
How the default thresholds are set depends on the type of procedure the user has requested. For most analysis, e.g. stratification or association analysis, the default thresholds are:
     Less than 0.10            Missing rate per individual          --mind
     Less than 0.10            Missing rate per SNP                 --geno
     Greater than 0.01         Minor allele frequency               --maf
However, for options that generate new versions of the original data, the inclusion thresholds are automatically set to include all individuals and all SNPs, unless otherwise instructed. This is typically what is required -- i.e. when removing a subset of individuals from the sample with the --remove option, one does not usually wish to also exclude SNPs with minor allele frequencies less than 0.01. If the --mind, --geno or --maf options are set explicitly on the command line when, for example, the the --remove command is given, then they will be acted upon. In this way, one can create new datasets based on these criteria, with the --recode or --make-bed options and explicitly specifying these options on the command line.

The --hwe and --me options are not run automatically, the way the above three options are: these options must be explicitly requsted.

Missing rate per person

The initial step in all data analysis is to exclude individuals with too much missing genotype data. This option is set as follows:
plink --file mydata --mind 0.1

which means exclude with more than 10% missing genotypes (this is the defalt value). A line in the terminal output will appear, indicating how many individuals were removed due to low genotyping. If any individuals were removed, a file called
     plink.irem     
will be created, listing the Family and Individual IDs of these removed individuals. Any subsequent analysis also specifeid on the same command line will be performed without these individuals.

One might instead wish to create a new PED file with these individuals permanently removed, simply add an option to generate a new fileset: for example,
plink --file data --mind 0.1 --recode --out cleaned

will generate files
     cleaned-recode.ped
     cleaned-recode.map
with the high-missing-rate individuals removed; alternatively, to create a binary fileset with these individuals removed:
plink --file data --mind 0.1 --make-bed --out cleaned

which results in the files
     cleaned.bed
     cleaned.bim
     cleaned.fam

HINT You can specify that certain genotypes were never attempted, i.e. that they are obligatory missing, and these will be handled appropriately by these genotyping rate filters. See the summary statistics page for more details.

Allele frequency

Once individuals with too much missing genotype data have been excluded, subsequent analyses can be set to automatically exclude SNPs on the basis of MAF (minor allele frequency):
plink --file mydata --maf 0.05

means only include SNPs with MAF >= 0.05. The default value is 0.01. This quantity is based only on founders (i.e. individuals for whom the paternal and maternal individual codes and both 0).

This option is appropriately counts alleles for X and Y chromosome SNPs.

Missing rate per SNP

Subsequent analyses can be set to automatically exclude SNPs on the basis of missing genotype rate, with the --geno option: the default is to exclude SNPs with more than 10% missing. To include all SNPs:
plink --file mydata --geno 1

(i.e. exclude SNPs with more than 100% missing, which means that all SNPs are included). As with the --maf option, these counts are calculated after removing individuals with high missing genotype rates.

Hardy-Weinberg Equilibrium

To exclude markers that failure the Hardy-Weinberg test at a specified significance threshold, use the option:
plink --file mydata --hwe 0.001

By default this filter uses an exact test (see this section). The standard asymptotic (1 df genotypic chi-squared test) can be requested with the --hwe2 option instead of --hwe.

The following output will appear in the console window and in plink.log, detailing how many SNPs failed the Hardy-Weinberg test, for the sample as a whole, and (when PLINK has detected a disease phenotype) for cases and controls separately:
Writing Hardy-Weinberg tests (founders-only) to [ plink.hwe ]
30 markers failed HWE test ( p <= 0.05 ) and have been excluded
        34 markers failed HWE test in cases
        30 markers failed HWE test in controls
This test will only be based on founders (if family-based data are being analysed) unless the --nonfounders option is also specified. In case/control samples, this test will be based on controls only, unless the --hwe-all option is specified, in which case the phenotype will be ignored. This can be important if parents are coded as missing in an affected offspring trio sample.

Please refer to the --hardy option for more details on producing summary statistics of all HWE rates.

Mendel error rate

For family-based data only, to exclude individuals and/or markers on the basis on Mendel error rate, use the option:
plink --file mydata --me 0.05 0.1

where the two parameters are:
  1. the first parameter determines that families with more than 5% Mendel errors (considering all SNPs) will be discarded.
  2. the second parameter indicates that SNPs with more than 10% Mendel error rate will be excluded (i.e. based on the number of trios);
Please refer to the summary statistics page for more details on generating summary statistics for Mendel error rates.

Note Currently, PLINK calculates the per SNP Mendel error rates at the same time as the per family error rates. In future releases, this may change such that the per family error rate is calculated after SNPs failing this test have been removed. Also, using this command currently removes entire nuclear families on the basis of high Mendel error rates: it will often be more appropriate to remove particular individuals (e.g. if a second sibling shows no Mendel errors). For this more fine-grained procedure, use the --mendel option to generate a complete enumeration of error rates by family and individual and exclude individuals as desired. Finally, it is possible to zero out specific Mendelian inconsistencies with the option --set-me-missing. This should be used in conjunction with a data generation command and the --me option. Specifically, the --me parameters should be both to 1, in order not to exclude any particular SNP or individual/family, but instead to zero out only specific genotypes with Mendel errors and save the dataset as a new file. (Both parental and offspring genotypes will be set to missing.)
plink --bfile mydata --me 1 1 --set-me-missing --make-bed --out newdata

 

This document last modified Tuesday, 15-Jul-2008 22:30:22 EDT