Example 1 : A quantitative trait measured in 300 individuals; 7 SNPs within a gene
This example is intended to represent a candidate gene study -- a
small number of SNPs, 7, has been typed with a small region, 20kb, of
moderate to strong LD. The approach here is to combine all 7 SNPs in
initial haplotype analyses. The examples also illustrate single-marker
analyses, and tests of specific combinations of subsets of SNPs.
You can download the PED file (candidate.ped), the DAT file (candidate.dat) and the MAP file (candidate.map).
So the results indicate a possible haplotypic association (a p-value
around 0.01). The output gives the number of individuals, with and
without parents, and the number of these that are informative. The
estimated haplotypes are then listed, along with a frequency, and four
regression coefficients: under the alternate and null models; for each
model, the Between and Within components are listed. Because this
sample contains only unrelated individuals, the default model is to
equate between and within components -- these numbers are therefore
identical. The numbers in [square brackets] indicate which parameters
have been constrained. The first parameter is always constrained to
equal zero. The numbers at the bottom of the colums are minus twice
the log-likelihood under the alternate and null; the difference
between these gives the likelihood ratio test (LRT) which is
chi-square; the degrees of freedom and associated p-value is also
given.
For binary traits, the regression coefficients from the logistic
regression can be converted to odds ratios by taking the exponent. In
the case of quantitaive variables, they are simply linear regressions
coefficients.
For quantitaive traits, the proportion of variance accounted for by
the locus is given; the first number represents the variance accounted
for under the alternate, the second number, under the null.
The proportion of haplotypes covered is high in this case. A value of
0.98 indicates that 2% of all haplotypes have been excluded (i.e. they
are too rare). By default, haplotypes with a frequency of less than 1%
are excluded from analysis; we might want to consider only haplotypes
with a frequency of 2% or more, given the moderate sample size. This
is done by using the --at option (absolute threshold). This
should decrease the number of degrees of freedom in a omnibus test of
haplotypic association; also, if rare haplotypes are only seen in a
handful of individuals, the parameter estimates associated with them
may be unreliable.
Also note that in the example below we have fixed the mean and variance of
the trait to population values. This can be useful in selected
samples, to increase power; even in unselected samples, fixing the
mean and variance for these nuissance parameters will speed up
estimation and make convergence at a global minimum more likely. In
this case, we know that the population mean and variance are 0 and 1
respectively:
So by excluding the rare haplotypes, the overall p-value is now more
significant.
To perform a single marker analysis, we can use the --alt
option, i.e. to specify which marker(s) are to be included in the
alternate model. Omitting the --alt implies that all SNPs are
to be included in the alternate model. Omitting --null
implies that no SNPs are to be included in the null model. In this
case, we are testing the second SNP:
whap --file candidate --mean 0 --var 1 --alt 2
WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000.
300 of 300 individuals are informative
Hap Freq Alt(B) Alt(W) Null(B) Null(W)
--- ----- ------ ------ ------- -------
1 0.700 0.000 0.000 [1] 0.000 0.000 [1]
2 0.300 -0.267 -0.267 [2] 0.000 0.000 [1]
--- ----- ------ -------
841.794 850.363
Locus % variance = 0.030 0.000
Proportion of haplotypes covered = 1.000
LRT = 8.569
df = 1
p = 0.00342
A convenient way to obtain all single marker results is by
performing a sliding window analysis with a window size of 1 SNP (the
--alt option now describes the window, rather than
the specific markers):
WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000.
Results of sliding window analysis
Global permutation tests
------------------------
P_MAX = 8.569 p = 1.0000000000
P_SUM = 25.990 p = 1.0000000000
Local permutation tests
-----------------------
>> SNP1 616 P_1= 0.013 p= 1.0000000000
>> SNP2 2717 P_2= 8.569 p= 1.0000000000
>> SNP5 10414 P_3= 7.839 p= 1.0000000000
>> SNP6 12981 P_4= 4.211 p= 1.0000000000
>> SNP7 15269 P_5= 0.580 p= 1.0000000000
>> SNP8 17526 P_6= 4.707 p= 1.0000000000
>> SNP9 19667 P_7= 0.071 p= 1.0000000000
The results are presented differently when in sliding window
mode. Also note that the p-values are based on permutation testing --
in this case, we did not specify any permutations, and so the p-values
will be all 1. The first column of numbers (starting 616)
gives the positions of the SNPs, as specified in the MAP file. The
second column of numbers give the test statistics (i.e. note that the
second value gives the same 8.569 value as in the single marker
analysis illustrated above).
Below we will show how to specify permutation tests, in which case
these p-values would be meaningful. We can obtain p-values and
parameter estimates for each SNP by using the --allres
command -- this will print the full results for each window
as well as the summary.
Note that the majority of the output is sent to STDOUT, which can be
redirected to a file: e.g.
whap --file example > results_file
This is how whap is intended to be used. Certain output is
directed to STDERR, so a progress report should still print to the
screen. This stream can be redirected also: for example, bash *nix
users can enter
whap --file example > results_file 2> /dev/null
WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000.
Window position : [1]
Hap Freq Alt(B) Alt(W) Null(B) Null(W)
--- ----- ------ ------ ------- -------
2 0.903 0.000 0.000 [1] 0.000 0.000 [1]
1 0.097 -0.016 -0.016 [2] 0.000 0.000 [1]
--- ----- ------ -------
850.350 850.363
Locus % variance = 0.000 0.000
Proportion of haplotypes covered = 1.000
LRT = 0.013
df = 1
p = 0.91
...
... etc
...
Looking at the single marker results, we may want to focus in on
certain SNPs to form haplotypes, rather than using all 7. Let's only
look at the 2nd, 3rd and 4th SNPs, to form 3-SNP haplotypes. Note the
use of commas (no spaces) to delimit the SNPs specified in the
alternate model:
To drop a marker from this 3-SNP haplotype set, e.g. to test the
effect of the 2nd SNP conditional on the haplotype background formed
by the inclusion of the 3rd and 4th.
So despite seeing that the second SNP had a significant single marker
result, this disappears when its effect is estimated conditional on
this haplotype background. This is consistent with the single SNP
effect being due to linkage disequilibrium with these other SNPs.
Finally, it is possible to manually set up parameter constraints. For
instance, say that a previous study found one specific haplotype to
be associated with this trait: we may then wish to test that haplotype
against all others in a 1 df test. This can be achieved using the
--constrain option:
Note how the command line set of constraints matches with the numbers
in [square brackets] in the output. The first coefficient, fixed to
zero, must always be specified as 1 under both the alternate and null.
To set up a more complex set of constraints, e.g. to group all
haplotypes with 12 at the 3rd and 4th SNPs, and test this
group against all other haplotypes: (It is up to the user to ensure
that the null model is nested in the alternate.)
Note that adding the command --hs will perform all
possible 1 degree of freedom haplotype-specific tests, i.e. in place
of the omnibus test. When used in conjunction with the
--window option, the effect of --hs is
slightly modified: only the most significant haplotype-specific
association is recorded for each window position: the significance of
this "largest specific haplotype" is tested empirically,
i.e. --perm is required, by comparing the largest observed
value to the largest value in each permuted dataset.
That's enought for this example. Next lets consider a different
scenario -- a sample of affected individuals, with measured parental
genotypes typed on a large number of markers.
Example 2 : A qualitative trait measured in 129 TDT trios; 103 SNPs across a large region
The sample conists of 129 affeced-offspring parent trios,
with 103 SNPs genotyped on 5q31. We shall perform a single marker
analysis of all 103 SNPs, and then an 8-marker sliding window
haplotype analysis. Download the PED file
chr5q31.ped, the DAT file chr5q31.dat
and MAP file chr5q31.map.
As highlighted in the conditional analysis section, affected-only TDT trio designs must be analysed with the
conditional framework, i.e. modelling genotype conditional on
trait. It is also necessary to fix the population prevalence, and to
specify that only within-family effects are to be estimated. Let's say
the population prevalence is 1%: these options are specified as
whap --file chr5q31 --alt 1 --cond --prev 0.01 --model w
which performs a TDT-type analysis of the first SNP in the sample. To
extend this to the entire region (and perform some region-wide
empirical significance testing) simply modify the command to
Two global empirical significance values are presented: P_MAX
and P_SUM. The first compares the maximum observed score in
the observed data against each maximum of the 103 scores in permuted
datasets. The proportion of times the permuted maximum exceeds the
observed maximum gives the significance value (actually the formula
(R+1)/(N+1) is used). A similar logic applies for the second global
significance value, except
it is based on the sum of the 103 scores, rather than the
maximum. Note that 0.0019960080 equals 1/501 -- it would be necessary
to perform more permutations to obtained more significant possible
p-values.
Note that although these maximum and summed p-values give an
indication of region wide significance, and so correct for the
multiple testing inherent in large genomic association scans, the
individual SNP p-values are not individually corrected for mulitple
testing.
Repeating the analysis, but now with an 8-marker sliding window, and
looking only at common haplotypes (those with a frequency above 5%) we
obtain the following results:
Again the global measures are significant. A good way to visualise
these results is to plot the base-position column against the
-log10(p-value).
Note that a value of -1 for the primary test indicates that no windows
were able to span these markers, i.e. given that windows skip gaps
greater than 100kb, and so the p-values are 'nan'.
Created by Shaun Purcell;
Last updated by Lori Thomas: March 2006