PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.03 (10-Jun-2008)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | Haplotypes | Conditional tests | Proxy association | Imputation | Clumping | Epistasis | Copy Number Variation | R-plugins | SNP annotation | Simulation | Profiles | Resources | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. Multimarker tests 15. Conditional haplotype tests 16. Proxy association 17. Full imputation (beta) 18. LD-based results clumping 19. Epistasis 20. Copy Number Variation 21. R-plugins 22. SNP annotation lookup 23. Simulation tools 24. Profile scoring 25. Resources 26. Miscellaneous 27. FAQ & Hints

28. gPLINK
 

SNP simulation routine

PLINK provides an interface to a very simplistic SNP simulation routine, designed to generate large SNP datasets for population-based, case/control studies. This function is largely intended as a convenience function for generating data to prototype new methods, comparing the power of different approaches, etc, rather than producing realistic whole genome data. Critically, all SNPs simulated are unlinked and in linkage equilibrium.

Basic usage

The basic command to simulate a SNP data file is the --simulate option,
./plink --simulate wgas.sim --make-bed --out sim1

which takes as a parameter the name of a file (here wgas.sim) that describes the to-be-simulated data.

The simulation file wgas.sim is as follows:
     100000  null      0.00 1.00  1.00
     100     disease   0.00 1.00  2.00
These files can have 1 or more rows, where each row has exactly five fields, as follows
     Number of SNPs in this set
     Label of this set of SNPs
     Lower allele frequency range
     Uppoer allele frequency range
     Odds ratio for disease
Given this file, PLINK would generate 100,000 SNPs with no association with disease. Each SNP would have its own population allele frequency, generated as a uniform number between, in this case, 0.00 and 1.00. In addition, 100 extra SNPs will be simulated that are associated with disease (population odds ratio of 2.00).

The names of each SNP would follow from the label (which must be unqiue), with a number appended, e.g.
     null_0
     null_1
     null_2
     ...
     disease_99
An exception is that if a set only contains a single SNP, nothing is appended to the label. This is useful in generating multiple samples from the same population, as described below.

Obviously, a uniform allele frequency range is not realistic: one could instead specify a series of bins to enrich for rarer SNPs, if so desired, to build a more realistic spectrum of allele frequencies (not that the example below is meant to be more realistic).
     20000  nullA     0.00 0.05  1.00
     10000  nullB     0.05 0.10  1.00
      5000  nullC     0.10 0.20  1.00
     10000  nullD     0.20 0.99  1.00
      ... 
As well as generating the actual data, the --simulate outputs to the LOG file the following:
     Reading simulation parameters from [ wgas.sim ]
     Writing SNP population frequencies to [ plink.simfreq ]
     Read 2 sets of SNPs, specifying 100100 SNPs in total
     Simulating 100 cases and 100 controls
     Assuming a disease prevalence of 0.01
The plink.simfreq file is described below. By default, 100 cases and 100 controls are generated. This can be changed with the command-line options
     --simulate-ncases 5000
and
     --simulate-ncontrols 5000
for example. Likewise, the default disease prevalence is assumed to be 0.01. This can be changed with
     --simulate-prevalence 0.05
for example.

In the example above, the simulated data were directly saved to a binary fileset: this need not be the case. For example, any other analysis command could instead have been applied, e.g. --simulate acts just like --file or --bfile:
./plink --simulate wgas.sim --assoc

although the actual simulated data would be subsequently lost of course.

Hint This tool only generates individuals drawn from a homogeneous population, but you can easily imagine using several --simulate runs then using PLINK commands to merge the resulting files to specify more complex scenarios, e.g. representing population stratification, allelic heterogeneity, etc.

Resimulating a sample from the same population

The --simulate command also generates the file plink.simfreq. This records, for each SNP of the two sets, null and disease from the wgas.sim example, the actual allele frequency chosen for that particular SNP when simulating the data. For example,
     1 null_0   0.1885 0.1885       1
     1 null_1   0.424675 0.424675   1
     1 null_2   0.12797 0.12797     1
     1 null_3   0.544394 0.544394   1
     1 null_4   0.938641 0.938641   1
     ....
Conveniently, this information is output in the same format as the original simulation file: note how the upper and lower allele frequency range is converged to specify a particular value, i.e. the first row shows a range of 0.1885 to 0.1885, i.e. effectively forcing the allele frequency for the first SNP to be 0.1885. This can be useful, as to generate a new independent dataset from the same population as the first, you would simply use the plink.simfreq output file, as input for a new --simulate command, see below.

Putting this together, one might imagine setting up a simple screen/replicate simulation design: first we generate the original WGAS screening data
./plink --simulate wgas.sim --make-bed --out screen

run our association test
./plink --bfile screen --assoc

and extract a list of significant SNPs (here using the Unix gawk command, to filter on the p-value column, 9)
gawk ' NR>1 && $9 < 1e-3 { print $2 } ' plink.assoc > positives

and then generate and test these same SNPs in an independent sample
./plink --simulate screen.simfreq --extract positives --assoc --out replication

etc. By labeling true disease SNPs and null SNPs sensibly as above, you can tell how many true positives and false positives appear at the screening and the replication stages, e.g. using Unix bash shell scripting to summarise results:
   t=1e-3
   s0=`fgrep null plink.assoc | gawk ' $9 < t ' t=$t | wc -l`
   s1=`fgrep disease plink.assoc | gawk ' $9 < t ' t=$t | wc -l`
   echo "Detected $s1 true positives and $s0 false positives in screening"

   t=1e-2
   s0=`fgrep null replication.assoc | gawk ' $9 < t ' t=$t | wc -l`
   s1=`fgrep disease replication.assoc | gawk ' $9 < t ' t=$t | wc -l`
   echo "Of these, $s1 true positives and $s0 false positives replicate"
 
This document last modified Wednesday, 11-Jun-2008 18:14:43 EDT