PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.03 (10-Jun-2008)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | Haplotypes | Conditional tests | Proxy association | Imputation | Clumping | Epistasis | Copy Number Variation | R-plugins | SNP annotation | Simulation | Profiles | Resources | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. Multimarker tests 15. Conditional haplotype tests 16. Proxy association 17. Full imputation (beta) 18. LD-based results clumping 19. Epistasis 20. Copy Number Variation 21. R-plugins 22. SNP annotation lookup 23. Simulation tools 24. Profile scoring 25. Resources 26. Miscellaneous 27. FAQ & Hints

28. gPLINK
 

Resources available for download

This page contains links to several freely-available resources, mostly generated by other individuals. All these resources are provided "as is", without any guarantees regarding their correctness or utility.

The Phase 2 HapMap as a PLINK fileset

The HapMap genotype data (release 22) are available here as PLINK binary filesets. The SNPs are currently coded according NCBI build 36 coordinates on the forward strand. Several versions are available here: the entire dataset (a single, very large fileset: you will need a computer with at least 2Gb of RAM to load this file).

The filtered SNP set refers to a list of SNPs that have MAF greater than 0.01 and genotyping rate greater than 0.95 in the 60 CEU founders. This fileset is probably a good starting place for imputation in samples of European descent. Filtered versions of the other HapMap panels will be made available shortly.

Thanks to Paul de Bakker for generating these files.

Description File size File name
Entire HapMap (270 individuals, 3.96 million SNPs) 110M hapmap_r22.zip
CEU founders (60 individuals, 3.96 million SNPs) 49M hapmap-ceu-all.zip
CEU founders (60 individuals, filtered 2.2 million SNPs) 29M hapmap-ceu.zip
CEU founders (as above, files split by chromosome, 1-22 and X) 29M hapmap-ceu-by-chr.zip

Teaching materials and example dataset

A tutorial can be downloaded from here; the material is similar to the online tutorial but slightly more involved. As it currently stands, it is designed to first use gPLINK to perform a set of basic tests and QC procedures and then move to standard PLINK for more in-depth analysis.

It is designed to work on a standard modern laptop computer or equivalent desktop. It was written for vesion 1.02 of PLINK, but should remain compatible with future releases.

Description File size File name
ZIP archive containing data 15M example.zip
ZIP archive containing teaching materials 1.3M teaching.zip
You are feel free to use, modify or distribute these files in any way you wish, although giving me appropriate credit for the materials would be appreciated.

The example.zip archive contains
     wgas1.ped              Whole-genome SNP data example PED file
     wgas1.map              Corresponding MAP file
     extra.ped              Follow-up genotyping for a particular region
     extra.map              Corresponding MAP file
     pop.cov                Population membership variable
     command-list.txt       List of all commands for 2nd part of practical
The teaching.zip archive contains a PowerPoint and a Word file:
     practical-1-slides.ppt
     practical-2-notes.doc     
These two files cover the first and second half of the tutorial respectively. The second document assumes the first half has already been completed (but also contains some introductory remarks concerning the data). I will probably update the Word document to also include the early commands covered in the PowerPoint/gPLINK part (i.e. so that the entire practical can be performed from the command line rather than using gPLINK). The list of commands (command-list.txt) is included so that people can cut-and-paste commands in, rather than type. If using DOS, it is a good idea to first increase the window width (right click on header on DOS window, Properties, Layout and increase buffer and window width to around 120 characters).

Everything should be fairly self-explantory after looking through the PowerPoint file and Word document.

Multimarker test lists

These files, generated by Itsik Pe'er and others, facilitate the 'multi-marker predictor' approach to association testing, as described in the manusctipt:
     Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D 
     & Daly MJ (2006) Evaluating and improving power in whole-genome 
     association studies using fixed marker sets. Nat Genet, 38(6): 605-6.
They are PLINK-formatted lists of multimarker tests selected for Affymetrix 500K and Illumina whole genome products, based on consideration of the CEU Phase 2 HapMap (at r-squared=0.8 threshold). One should download the appropriate file and run with the --hap option (after ensuring that any strand issues have been resolved). Note These haplotypes are specified in terms of the +ve (positive) strand relative to the HapMap. You might need to reformat your data prior to using these files (using the --flip command, for instance) before you can use them.

Note These tables list all tags for every common HapMap SNP, at the given r-squared threshold. The same haplotype may therefore appear multiple times (i.e. if it tags more than 1 SNP).

Note These tables obviously assume that all tags on present in the final, post-quality-control dataset: i.e. if certain SNPs have been removed, it will be better to reselect the predictors -- that is, these lists should really only be used as a first pass, for convenience.

In general, however, quite possibily an easier and better strategy is instead to analyse the data within an imputation context, e.g. utilising the proxy association procedures rather than using these fixed lists.

Gene sets

Here is a PLINK-format SET file, containing a genome-wide set of genes (N=18272). The co-ordinates are based on NCBI B36 assembly, dbSNP 126; a gene is arbitrarily defined as including 50kb upstream and downstream, although this is chosen without respect to linkage disequilibrium: clearly there is room for a more intelligent, LD-informed mapping of SNPs to genes here, but this file is provided in the interim as a rough-and-ready starting point.

    Download (ZIP archive): gene-list.zip


This document last modified Wednesday, 11-Jun-2008 18:25:10 EDT