PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

Resources available for download

This page contains links to several freely-available resources, mostly generated by other individuals. All these resources are provided "as is", without any guarantees regarding their correctness or utility.

The Phase 2 HapMap as a PLINK fileset

The HapMap genotype data (the latest is release 23) are available here as PLINK binary filesets. The SNPs are currently coded according NCBI build 36 coordinates on the forward strand. Several versions are available here: the entire dataset (a single, very large fileset: you will need a computer with at least 2Gb of RAM to load this file).

The filtered SNP set refers to a list of SNPs that have MAF greater than 0.01 and genotyping rate greater than 0.95 in the 60 CEU founders. This fileset is probably a good starting place for imputation in samples of European descent. Filtered versions of the other HapMap panels will be made available shortly.
Description File size File name
Entire HapMap (release 23, 270 individuals, 3.96 million SNPs) 120M hapmap_r23a.zip
CEU (release 23, 90 individuals, 3.96 million SNPs) 59M hapmap_CEU_r23a.zip
YRI (release 23, 90 individuals, 3.88 million SNPs) 65M hapmap_YRI_r23a.zip
JPT+CHB (release 23, 90 individuals, 3.99 million SNPs) 58M hapmap_JPT_CHB_r23a.zip
CEU founders (release 23, 60 individuals, filtered 2.3 million SNPs) 31M hapmap_CEU_r23a_filtered.zip
YRI founders (release 23, 60 individuals, filtered 2.6 million SNPs) 38M hapmap_YRI_r23a_filtered.zip
JPT+CHB founders (release 23, 90 individuals, filtered 2.2 million SNPs) 33M hapmap_JPT_CHB_r23a_filtered.zip


Description File size File name
Entire HapMap (release 22, 270 individuals, 3.96 million SNPs) 110M hapmap_r22.zip
CEU founders (release 22, 60 individuals, 3.96 million SNPs) 49M hapmap-ceu-all.zip
CEU founders (release 22, 60 individuals, filtered 2.2 million SNPs) 29M hapmap-ceu.zip
CEU founders (release 22, as above, files split by chromosome, 1-22 and X) 29M hapmap-ceu-by-chr.zip


Description File name
Hapmap individuals with population information ( FID, IID, POP ) hapmap.pop

Teaching materials and example dataset

A tutorial can be downloaded from here; the material is similar to the online tutorial but slightly more involved. As it currently stands, it is designed to first use gPLINK to perform a set of basic tests and QC procedures and then move to standard PLINK for more in-depth analysis.

It is designed to work on a standard modern laptop computer or equivalent desktop. It was written for vesion 1.02 of PLINK, but should remain compatible with future releases.

Description File size File name
ZIP archive containing data 15M example.zip
ZIP archive containing teaching materials 1.3M teaching.zip
You are feel free to use, modify or distribute these files in any way you wish, although giving me appropriate credit for the materials would be appreciated.

The example.zip archive contains
     wgas1.ped              Whole-genome SNP data example PED file
     wgas1.map              Corresponding MAP file
     extra.ped              Follow-up genotyping for a particular region
     extra.map              Corresponding MAP file
     pop.cov                Population membership variable
     command-list.txt       List of all commands for 2nd part of practical
The teaching.zip archive contains a PowerPoint and a Word file:
     practical-1-slides.ppt
     practical-2-notes.doc     
These two files cover the first and second half of the tutorial respectively. The second document assumes the first half has already been completed (but also contains some introductory remarks concerning the data). I will probably update the Word document to also include the early commands covered in the PowerPoint/gPLINK part (i.e. so that the entire practical can be performed from the command line rather than using gPLINK). The list of commands (command-list.txt) is included so that people can cut-and-paste commands in, rather than type. If using DOS, it is a good idea to first increase the window width (right click on header on DOS window, Properties, Layout and increase buffer and window width to around 120 characters).

Everything should be fairly self-explantory after looking through the PowerPoint file and Word document.

Multimarker test lists

These files, generated by Itsik Pe'er and others, facilitate the 'multi-marker predictor' approach to association testing, as described in the manusctipt:
     Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D 
     & Daly MJ (2006) Evaluating and improving power in whole-genome 
     association studies using fixed marker sets. Nat Genet, 38(6): 605-6.
They are PLINK-formatted lists of multimarker tests selected for Affymetrix 500K and Illumina whole genome products, based on consideration of the CEU Phase 2 HapMap (at r-squared=0.8 threshold). One should download the appropriate file and run with the --hap option (after ensuring that any strand issues have been resolved). Note These haplotypes are specified in terms of the +ve (positive) strand relative to the HapMap. You might need to reformat your data prior to using these files (using the --flip command, for instance) before you can use them.

Note These tables list all tags for every common HapMap SNP, at the given r-squared threshold. The same haplotype may therefore appear multiple times (i.e. if it tags more than 1 SNP).

Note These tables obviously assume that all tags on present in the final, post-quality-control dataset: i.e. if certain SNPs have been removed, it will be better to reselect the predictors -- that is, these lists should really only be used as a first pass, for convenience.

In general, however, quite possibily an easier and better strategy is instead to analyse the data within an imputation context, e.g. utilising the proxy association procedures rather than using these fixed lists.

Gene sets

NOTE The gene range lists below have replaced this old gene SET file: you are advised to use the lists below rather than this file.

Here is a PLINK-format SET file, containing a genome-wide set of genes (N=18272). The co-ordinates are based on NCBI B36 assembly, dbSNP 126; a gene is arbitrarily defined as including 50kb upstream and downstream.

    Download (ZIP archive): gene-list.zip

Gene range lists

These are gene lists: files containing lists of genes, based on either hg17 or hg18 co-ordinates. The format is one gene per row,
   Chromosome
   Start position (bp)
   Stop position (bp)
   Gene name
These lists can be used with PLINK commands such as --make-set, --range, --gene-list, --cnv-intersect, --clump-range, etc.

These gene lists were downloaded from UCSC table browser for all RefSeq genes on July 24th 2008. Overlapping isoforms of the same gene were combined to form a single full length version of the gene. Isoforms that didn't overlap were left as duplicates of that gene.

Rather than using the gene sets (described above), we suggest using these gene lists to make gene sets on the fly (using --make-set-border if so desired, to add a fixed kb border on the fly).

    Gene list (hg18): glist-hg18
    Gene list (hg17): glist-hg17

Functional SNP attributes

This file contains a list of codes to indicate the functional status of SNPs. It is designed to be used in conjunction with the --annotate command.

This file was created as follows: we downloaded all data from dbSNP, build 129, and extracted lists of SNPs that are nonsense, frameshift, missense or splice-site variants. We intersected this list with the SNPs available in the Phase 2 CEU HapMap dataset, and selected lists of SNPs that strongly tagged this functional SNPs (r-sq above 0.5; MAF above 0.01). For each HapMap SNP that either is or tags a functional SNP, we created an entry in the file below. Here upper-case represents that that SNP is a coding SNP in HapMap; lower-case represents that the SNP is in strong LD with a coding variant, in HapMap.
     =NONSENSE        =nonsense
     =MISSENSE        =missense
     =FRAMESHIFT      =frameshift
     =SPLICE          =splice
In future, we will post revised attribute files, to include more annotations, and information (e.g. such as a version with the rs ID of the functional SNP(s) that is tagged).

    SNP attributes: snp129.attrib.gz

To use the file with the --annotate command, for example:
    plink --annotate myresults.txt  attrib=snp129.attrib.gz
(You can use gunzip, or WinZip, to decompress this file.)
 

This document last modified Tuesday, 06-Oct-2009 19:16:52 EDT