|
1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. Multimarker tests
15. Conditional haplotype tests
16. Proxy association
17. Full imputation (beta)
18. LD-based results clumping
19. Epistasis
20. Copy Number Variation
21. R-plugins
22. SNP annotation lookup
23. Simulation tools
24. Profile scoring
25. Resources
26. Miscellaneous
27. FAQ & Hints
28. gPLINK
|
|
Multimarker haplotype tests
All tests described above are based on single SNP tests. It is also
possible to impute haplotypes based on multimarker predictors using
the standard E-M algorithm and to perform simple tests based on the
distribution of probabilistically-inferred set of haplotypes for each
individual.
As well as the autosomes, X and haploid chromosomes should be
appropriately handled. Phasing can either be based on a sample of
unrelated individuals, or certain kinds of family data. First, all
founders are phased using the E-M algorithm; then all descendents of
these founders are phased given the set of possible parental phases
and assuming random-mating. Currently it is not possible to phase
sibships without parents. The current implementation of the phasing
and haplotype testing algorithm is designed focus on relatively small
regions of the genome, rather than to phase whole chromosomes at once.
HINT! Another approach to haplotype-testing can be
found under the page describing proxy
association. This set of methods essentially just provide a
different interface to the exact same E-M phasing and
haplotype-testing algorithms, one that is centered around a specific
reference SNP.
Specification of haplotypes to be estimated
Haplotype testing in PLINK requires that the user supplies a file
listing the haplotypes to be tested (Some precomputed lists are
given below which might be useful in some
circumstances.) The formats of these files are described below. An
alternative is to specify a simple, sliding window of fixed haplotype
size (also described below).
The command
plink --file mydata --hap myfile.hlist
will read the file myfile.hlist, each row of which
is expected to have one of the three following formats:
1) Particular allele specified
The first format specifies a particular haplotype at a given locus. Two
example rows of this format are:
rs1001 5 0 201 1 2 TC snp1 snp2
rs1002 5 0 202 A C TTA snp1 snp3 snp4
...
The columns represent:
Col 1 : Imputed SNP name
Col 2 : Imputed SNP chromosome
Col 3 : Imputed SNP genetic distance (default: Morgan coding)
Col 4 : Imputed SNP physical position (bp units)
Col 5 : Imputed SNP allele 1 name
Col 6 : Imputed SNP allele 2 name
Col 7 : Tag SNP allele/haplotype that equals imputed SNP allele 1
Col 8+ : Tag SNP(s) [in same order as haplotype in Col 7]
Here we have explicitly specified the TC and TTA
haplotypes. For example, in the first case, SNPs snp1 and
snp2 may have all four common haplotypes seen in the sample,
TT, CT and CC as well as TC; this
command would select only the TC haplotype to be imputed, or as
the focus of haplotype analysis. The imputed SNP, rs1001
therefore has the following alleles:
TC/TC 1/1
TC/* 1/2
*/* 2/2
and will be positioned on chromosome 5, and base-positon 201. Haplotypes
other than TC will be coded 2.
The imputed SNP details (alleles, etc) will only be used if the
--hap-impute option has been requested. For --hap-assoc
and --hap-tdt options (which consider all possible phases rather
than just imputing the most likely) these are not considered (but they are
still required in this input file).
2) 'Wildcard' specification
Alternatively, all haplotypes at a given locus above the --maf
threshold can be automatically estimated by entering a line in
myfile.hlist as
follows:
* snp1 snp2 snp3
* snp1 snp2
i.e. where the first character is an asterisk *, which would,
taking just the first line for example, create all 3-SNP haplotypes for
the SNPs labelled in the MAP file as snp1, snp2 and
snp3, above
the minor allele frequency threshold. If the haplotypes were, for example,
AAC, AGG and TGG, then the following names
would be automatically assigned:
H1_AAC_
H1_AGG_
H1_TGG_
Haplotypes based on subsequent lines in the file would be labelled
H2_*_, H3_*_, etc. In this case, all two-SNP haplotypes
for snp1 and snp2 would start H2_. The
chromosome and position flags for the new haplotypes are set to equal the
first SNP of the set.
3) 'Named wildcard' specification
Finally, this format is identical to the previous wildcard specification,
except a name can be given to the haplotype. This uses ** instead of
* to start a row; the second entry is then interpreted as the name
of the haplotype locus rather than the first SNP. For example:
** BLOCK1 snp1 snp2 snp3
** BLOCK2 snp6 snp7
The only difference is that BLOCK1 and BLOCK2 names will be used
in the output instead of H1 and H2 being assigned automatically.
4) Sliding window specification
Finally, instead of specifying a haplotype file with the --hap option,
you can use the --hap-window option to specifty all haplotypes in
sliding windows of a fixed number of SNPs (shifting 1 SNP at a time).
plink --bfile mydata --hap-window 3 --hap-assoc
to form all 3-SNP haplotypes across the entire dataset
(respecting chromosome boundaries, however). In this case
the windows will be automatically named WIN1, WIN2, etc.
Precomputed lists of multimarker tests
Below are links to some PLINK-formatted lists of multimarker tests
selected for Affymetrix 500K and Illumina whole genome products, based
on consideration of the CEU Phase 2 HapMap (at r-squared=0.8
threshold). One should download the appropriate file and run with
the --hap option (after ensuring that any strand issues have
been resolved). These files were generated by Itsik Pe'er and others,
as described in this manuscript:
Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D
& Daly MJ (2006) Evaluating and improving power in whole-genome
association studies using fixed marker sets. Nat Genet, 38(6): 605-6.
These tables list all tags for every common HapMap SNP, at the given
r-squared threshold. The same haplotype may therefore appear multiple
times (i.e. if it tags more than 1 SNP). The haplotypes are specified
in terms of the + (positive) strand relative to the HapMap. You might
need to reformat your data prior to using these files (using the
--flip command, for instance) before you can use them.
Note These tables obviously assume that all tags on
present in the final, post-quality-control dataset: i.e. if certain
SNPs have been removed, it will be better to reselect the
predictors -- that is, these lists should really only be used as a
first pass, for convenience.
Estimating haplotype frequencies
To obtain the haplotype frequencies for all haplotypes in each window,
use the option:
plink --file mydata --hap myfile.hlist --hap-freq
which will generate the file
plink.freq.hap
which contains the fields (no header)
LOCUS Haplotype locus / window name
HAPLOTYPE Haplotype identifer
F Frequency in sample (founders)
Testing for haplotype-based case/control and quantitative trait association
In a population-based sample of unrelated individuals, case/control and quantitative
traits can be analysed for haplotype associations, using the option, for example,
plink --file mydata --hap myfile.hlist --hap-assoc
which will generate haplotype-specific tests (1df) for both disease and
quantitative traits; for disease traits only, an omnibus association statistic
will also be computed. This option generates the file
plink.assoc.hap
which contains the following fields:
LOCUS Haplotype locus / window name
HAPLOTYPE Haplotype identifer / "OMNIBUS"
F_A Frequency in cases
F_U Frequency in controls
CHISQ Test for association
DF Degrees of freedom
P Asymptotic p-value
SNPS SNPs forming the haplotype
or
plink.qassoc.hap
which contains the following fields:
LOCUS Haplotype locus / window name
HAPLOTYPE Haplotype identifer
NANAL Number of individuals in analysis
BETA Regression coefficient
RSQ Proportion variance explained
STAT Test statistic (T)
P Asymptotic p-value
SNPS SNPs forming the haplotype
In all cases, the tests are based on the expected number of haplotypes
each individual has (which might be fractional). The case/control
omnibus test is a H-1 degree of freedom test, if there are H
haplotypes.
Haplotype-based TDT association test
If the case/control data are being analysed, use the option
plink --file mydata --hap myfile.hlist --hap-tdt
to test for TDT haplotype-specific association. This
option generates the file
plink.tdt.hap
which contains the following fields:
LOCUS Haplotype locus / window name
HAPLOTYPE Haplotype identifer / "OMNIBUS"
T Number of transmitted haplotypes
U Number of untransmitted haplotypes
CHISQ Test for association
P Asymptotic p-value
Imputing multimarker haplotypes
If the --hap-impute option is also given, this will create two
new files:
plink --file mydata --hap myfile.hlist --hap-impute
will generate the file:
plink.impute.ped
plink.impute.map
based on the most likely E-M phase reconstructed haplotypes. One could then
simply treat the most likely haplotype assignments as SNPs and use all the
standard analytic options of PLINK, e.g. --assoc.
Warning This represents a quick and dirty
approach to haplotype testing. Depending on how accurately
the haplotypes have been imputed (i.e. the range of maximum posterior
probabilities per individual) some bias will be introduced into
subsequent tests based on these 'SNPs'. Typically, as long as cases
and controls are phased together, as they are here, this bias is
likely to be quite small and so should not substantively impact
results (unpublished simulation results, SMP). Furthermore, exact
methods can be used to refine the association for the putative hits
discovered by this approach.
NOTE Future versions will allow for a binary PED
file to be created from the --hap-impute command. You do
not need to specify --recode when using
--hap-impute.
Tabulating individuals' haplotype phases
To obtain a summary of all possible haplotype phases and the corresponding
posterior probabilities (i.e. given genotype data), use the command:
plink --file mydata --hap myfile.hlist --hap-phase
which will generate the file
plink.phase-*
where * is the name of the 'window' (i.e. the row of the
haplotype list file). That is, if the haplotype list contains
multiple rows, then multiple phase files will be generated.
These files contain the fields, where each row is one possible
haplotype phase for one individual:
FID Family ID
IID Individual ID
PH Phase number for that individual (0-based)
HAP1 First haplotype, H1
HAP2 Second haplotype, H2
POSTPROB P(H1,H2 | G )
BEST 1 if most likely phase for that individual
|
|