|
1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. Multimarker tests
15. Conditional haplotype tests
16. Proxy association
17. Full imputation (beta)
18. LD-based results clumping
19. Epistasis
20. Copy Number Variation
21. R-plugins
22. SNP annotation lookup
23. Simulation tools
24. Profile scoring
25. Resources
26. Miscellaneous
27. FAQ & Hints
28. gPLINK
|
|
SNP simulation routine
PLINK provides an interface to a very simplistic SNP
simulation routine, designed to generate large SNP datasets for
population-based, case/control studies. This function is largely
intended as a convenience function for generating data to
prototype new methods, comparing the power of different approaches,
etc, rather than producing realistic whole genome data.
Critically, all SNPs simulated are unlinked and in linkage
equilibrium.
Basic usage
The basic command to simulate a SNP data file is the --simulate option,
./plink --simulate wgas.sim --make-bed --out sim1
which takes as a parameter the name of a file (here wgas.sim) that describes the to-be-simulated data.
The simulation file wgas.sim is as follows:
100000 null 0.00 1.00 1.00
100 disease 0.00 1.00 2.00
These files can have 1 or more rows, where each row has exactly five fields, as follows
Number of SNPs in this set
Label of this set of SNPs
Lower allele frequency range
Uppoer allele frequency range
Odds ratio for disease
Given this file, PLINK would generate 100,000 SNPs with no association with disease. Each SNP would have
its own population allele frequency, generated as a uniform number between, in this case, 0.00 and 1.00. In addition,
100 extra SNPs will be simulated that are associated with disease (population odds ratio of 2.00).
The names of each SNP would follow from the label (which must be unqiue),
with a number appended, e.g.
null_0
null_1
null_2
...
disease_99
An exception is that if a set only contains a single SNP, nothing is appended to the label. This is useful in generating multiple
samples from the same population, as described below.
Obviously, a uniform allele frequency range is not realistic: one could instead specify a series of bins
to enrich for rarer SNPs, if so desired, to build a more realistic spectrum of allele frequencies (not that the example
below is meant to be more realistic).
20000 nullA 0.00 0.05 1.00
10000 nullB 0.05 0.10 1.00
5000 nullC 0.10 0.20 1.00
10000 nullD 0.20 0.99 1.00
...
As well as generating the actual data, the --simulate outputs to the LOG file the following:
Reading simulation parameters from [ wgas.sim ]
Writing SNP population frequencies to [ plink.simfreq ]
Read 2 sets of SNPs, specifying 100100 SNPs in total
Simulating 100 cases and 100 controls
Assuming a disease prevalence of 0.01
The plink.simfreq file is described below. By default, 100 cases and 100 controls are generated. This
can be changed with the command-line options
--simulate-ncases 5000
and
--simulate-ncontrols 5000
for example. Likewise, the default disease prevalence is assumed to be 0.01. This can be changed with
--simulate-prevalence 0.05
for example.
In the example above, the simulated data were directly saved to a binary fileset: this need not be the case. For example,
any other analysis command could instead have been applied, e.g. --simulate acts just like --file or --bfile:
./plink --simulate wgas.sim --assoc
although the actual simulated data would be subsequently lost of course.
Hint This tool only generates individuals drawn from a homogeneous population, but you can easily imagine using several
--simulate runs then using PLINK commands to merge the resulting files to specify more complex scenarios, e.g. representing
population stratification, allelic heterogeneity, etc.
Resimulating a sample from the same population
The --simulate command also generates the file plink.simfreq. This records, for each SNP of the two sets, null and
disease from the wgas.sim example, the actual allele frequency chosen for that particular SNP when simulating the
data. For example,
1 null_0 0.1885 0.1885 1
1 null_1 0.424675 0.424675 1
1 null_2 0.12797 0.12797 1
1 null_3 0.544394 0.544394 1
1 null_4 0.938641 0.938641 1
....
Conveniently, this information is output in the same format as the original simulation file: note how the upper and lower allele frequency
range is converged to specify a particular value, i.e. the first row shows a range of 0.1885 to 0.1885, i.e. effectively forcing the allele
frequency for the first SNP to be 0.1885. This can be useful, as to generate a new independent dataset from the same population as the
first, you would simply use the
plink.simfreq output file, as input for a new --simulate command, see below.
Putting this together, one might imagine setting up a simple screen/replicate simulation design: first we generate the original
WGAS screening data
./plink --simulate wgas.sim --make-bed --out screen
run our association test
./plink --bfile screen --assoc
and extract a list of significant SNPs (here using the Unix gawk command, to filter on the p-value column, 9)
gawk ' NR>1 && $9 < 1e-3 { print $2 } ' plink.assoc > positives
and then generate and test these same SNPs in an independent sample
./plink --simulate screen.simfreq --extract positives --assoc --out replication
etc. By labeling true disease SNPs and null SNPs sensibly as above, you can tell how many true positives and false
positives appear at the screening and the replication stages, e.g. using Unix bash shell scripting
to summarise results:
t=1e-3
s0=`fgrep null plink.assoc | gawk ' $9 < t ' t=$t | wc -l`
s1=`fgrep disease plink.assoc | gawk ' $9 < t ' t=$t | wc -l`
echo "Detected $s1 true positives and $s0 false positives in screening"
t=1e-2
s0=`fgrep null replication.assoc | gawk ' $9 < t ' t=$t | wc -l`
s1=`fgrep disease replication.assoc | gawk ' $9 < t ' t=$t | wc -l`
echo "Of these, $s1 true positives and $s0 false positives replicate"
|
|