whap

Haplotype-based association analysis package

Shaun Purcell, Massachusetts General Hospital, Boston, MA, USA
Pak Sham, Hong Kong University, Hong Kong
                                               
Background
Download
Examples
Conditional test tutorial
Usage
Warnings
Future developments
Licence
Citation
Contact
 

Examples

Example 1 : A quantitative trait measured in 300 individuals; 7 SNPs within a gene


 
This example is intended to represent a candidate gene study -- a small number of SNPs, 7, has been typed with a small region, 20kb, of moderate to strong LD. The approach here is to combine all 7 SNPs in initial haplotype analyses. The examples also illustrate single-marker analyses, and tests of specific combinations of subsets of SNPs.

You can download the PED file (candidate.ped), the DAT file (candidate.dat) and the MAP file (candidate.map).


 
To run the basic analysis with default settings:
 

whap --file candidate
WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: 
292 of 300 individuals are informative
          
Hap       Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
---       -----   ------          ------          -------         ------- 
2122221   0.305   0.000           0.000   [1]     0.000           0.000   [1]     
2112121   0.165   -0.234          -0.234  [2]     0.000           0.000   [1]     
2221211   0.119   -0.435          -0.435  [3]     0.000           0.000   [1]     
2212222   0.112   -0.434          -0.434  [4]     0.000           0.000   [1]     
2122222   0.109   0.085           0.085   [5]     0.000           0.000   [1]     
1112121   0.097   -0.197          -0.197  [6]     0.000           0.000   [1]     
2222221   0.040   0.136           0.136   [7]     0.000           0.000   [1]     
2212221   0.028   -0.478          -0.478  [8]     0.000           0.000   [1]     
2112222   0.014   -0.212          -0.212  [9]     0.000           0.000   [1]     
2121211   0.012   -0.299          -0.299  [10]    0.000           0.000   [1]     
---       -----                   ------                          ------- 
                                  807.074                         828.496 

Locus % variance =		  0.076	                          0.000
Proportion of haplotypes covered = 0.980
LRT = 21.422
df  = 9
p   = 0.0109


 
So the results indicate a possible haplotypic association (a p-value around 0.01). The output gives the number of individuals, with and without parents, and the number of these that are informative. The estimated haplotypes are then listed, along with a frequency, and four regression coefficients: under the alternate and null models; for each model, the Between and Within components are listed. Because this sample contains only unrelated individuals, the default model is to equate between and within components -- these numbers are therefore identical. The numbers in [square brackets] indicate which parameters have been constrained. The first parameter is always constrained to equal zero. The numbers at the bottom of the colums are minus twice the log-likelihood under the alternate and null; the difference between these gives the likelihood ratio test (LRT) which is chi-square; the degrees of freedom and associated p-value is also given.


 
For binary traits, the regression coefficients from the logistic regression can be converted to odds ratios by taking the exponent. In the case of quantitaive variables, they are simply linear regressions coefficients. For quantitaive traits, the proportion of variance accounted for by the locus is given; the first number represents the variance accounted for under the alternate, the second number, under the null.


 
The proportion of haplotypes covered is high in this case. A value of 0.98 indicates that 2% of all haplotypes have been excluded (i.e. they are too rare). By default, haplotypes with a frequency of less than 1% are excluded from analysis; we might want to consider only haplotypes with a frequency of 2% or more, given the moderate sample size. This is done by using the --at option (absolute threshold). This should decrease the number of degrees of freedom in a omnibus test of haplotypic association; also, if rare haplotypes are only seen in a handful of individuals, the parameter estimates associated with them may be unreliable.


 
Also note that in the example below we have fixed the mean and variance of the trait to population values. This can be useful in selected samples, to increase power; even in unselected samples, fixing the mean and variance for these nuissance parameters will speed up estimation and make convergence at a global minimum more likely. In this case, we know that the population mean and variance are 0 and 1 respectively:


 
whap --file candidate --at 2 --mean 0 --var 1
WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000.
275 of 300 individuals are informative
          
Hap       Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
---       -----   ------          ------          -------         ------- 
2122221   0.313   0.000           0.000   [1]     0.000           0.000   [1]     
2112121   0.169   -0.249          -0.249  [2]     0.000           0.000   [1]     
2221211   0.122   -0.417          -0.417  [3]     0.000           0.000   [1]     
2212222   0.115   -0.419          -0.419  [4]     0.000           0.000   [1]     
2122222   0.112   0.044           0.044   [5]     0.000           0.000   [1]     
1112121   0.099   -0.213          -0.213  [6]     0.000           0.000   [1]     
2222221   0.041   0.115           0.115   [7]     0.000           0.000   [1]     
2212221   0.029   -0.662          -0.662  [8]     0.000           0.000   [1]     
---       -----                   ------                          ------- 
                                  766.078                         787.673 

Locus % variance =                0.080	                          0.000
Proportion of haplotypes covered = 0.955
LRT = 21.595
df  = 7
p   = 0.00298

So by excluding the rare haplotypes, the overall p-value is now more significant.


 
To perform a single marker analysis, we can use the --alt option, i.e. to specify which marker(s) are to be included in the alternate model. Omitting the --alt implies that all SNPs are to be included in the alternate model. Omitting --null implies that no SNPs are to be included in the null model. In this case, we are testing the second SNP:


 
whap --file candidate --mean 0 --var 1 --alt 2

WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000. 
300 of 300 individuals are informative
          
Hap Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
--- -----   ------          ------          -------         ------- 
1   0.700   0.000           0.000   [1]     0.000           0.000   [1]     
2   0.300   -0.267          -0.267  [2]     0.000           0.000   [1]     
--- -----                   ------                          ------- 
                            841.794                         850.363 

Locus % variance = 0.030	0.000
Proportion of haplotypes covered = 1.000
LRT = 8.569
df  = 1
p   = 0.00342


 
A convenient way to obtain all single marker results is by performing a sliding window analysis with a window size of 1 SNP (the --alt option now describes the window, rather than the specific markers):


 
whap --file candidate --mean 0 --var 1 --window --alt 1

WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu

300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000. 
          
Results of sliding window analysis

Global permutation tests
------------------------
P_MAX = 8.569	 p = 1.0000000000
P_SUM = 25.990	 p = 1.0000000000

Local permutation tests
-----------------------
>> SNP1 616   P_1= 0.013	 p= 1.0000000000         		
>> SNP2 2717  P_2= 8.569	 p= 1.0000000000         		
>> SNP5 10414 P_3= 7.839         p= 1.0000000000         		
>> SNP6 12981 P_4= 4.211         p= 1.0000000000         		
>> SNP7 15269 P_5= 0.580         p= 1.0000000000         		
>> SNP8 17526 P_6= 4.707         p= 1.0000000000         		
>> SNP9 19667 P_7= 0.071         p= 1.0000000000         		


 
The results are presented differently when in sliding window mode. Also note that the p-values are based on permutation testing -- in this case, we did not specify any permutations, and so the p-values will be all 1. The first column of numbers (starting 616) gives the positions of the SNPs, as specified in the MAP file. The second column of numbers give the test statistics (i.e. note that the second value gives the same 8.569 value as in the single marker analysis illustrated above).


 
Below we will show how to specify permutation tests, in which case these p-values would be meaningful. We can obtain p-values and parameter estimates for each SNP by using the --allres command -- this will print the full results for each window as well as the summary.


 
Note that the majority of the output is sent to STDOUT, which can be redirected to a file: e.g.
whap --file example > results_file

 
This is how whap is intended to be used. Certain output is directed to STDERR, so a progress report should still print to the screen. This stream can be redirected also: for example, bash *nix users can enter
whap --file example > results_file 2> /dev/null


 
whap --file candidate --mean 0 --var 1 --window --alt 1 --allres

WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu

300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000. 


Window position :  [1]

Hap Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
--- -----   ------          ------          -------         ------- 
2   0.903   0.000           0.000   [1]     0.000           0.000   [1]     
1   0.097   -0.016          -0.016  [2]     0.000           0.000   [1]     
--- -----                   ------                          ------- 
                            850.350                         850.363 

Locus % variance = 0.000	0.000
Proportion of haplotypes covered = 1.000
LRT = 0.013
df  = 1
p   = 0.91


...
... etc
...


 
Looking at the single marker results, we may want to focus in on certain SNPs to form haplotypes, rather than using all 7. Let's only look at the 2nd, 3rd and 4th SNPs, to form 3-SNP haplotypes. Note the use of commas (no spaces) to delimit the SNPs specified in the alternate model:


 
whap --file candidate --mean 0 --var 1 --alt 2,3,4 --at 2

WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000. 
293 of 300 individuals are informative
          
Hap   Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
---   -----   ------          ------          -------         ------- 
122   0.417   0.000           0.000   [1]     0.000           0.000   [1]     
112   0.278   -0.246          -0.246  [2]     0.000           0.000   [1]     
212   0.146   -0.478          -0.478  [3]     0.000           0.000   [1]     
221   0.122   -0.390          -0.390  [4]     0.000           0.000   [1]     
222   0.036   0.120           0.120   [5]     0.000           0.000   [1]     
---   -----                   ------                          ------- 
                              812.944                         832.755 

Locus % variance = 0.073	0.000
Proportion of haplotypes covered = 0.985
LRT = 19.811
df  = 4
p   = 0.000544


 
To drop a marker from this 3-SNP haplotype set, e.g. to test the effect of the 2nd SNP conditional on the haplotype background formed by the inclusion of the 3rd and 4th.


 
whap --file candidate --mean 0 --var 1 --alt 2,3,4 --null 3,4 --at 2

WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000. 
293 of 300 individuals are informative
          
Hap   Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
---   -----   ------          ------          -------         ------- 
122   0.417   0.000           0.000   [1]     0.000           0.000   [1]     
112   0.278   -0.246          -0.246  [2]     -0.335          -0.335  [2]     
212   0.146   -0.478          -0.478  [3]     -0.335          -0.335  [2]     
221   0.122   -0.390          -0.390  [4]     -0.389          -0.389  [3]     
222   0.036   0.119           0.119   [5]     0.000           0.000   [1]     
---   -----                   ------                          ------- 
                              812.944                         815.607 

Locus % variance = 0.073	0.060
Proportion of haplotypes covered = 0.985
LRT = 2.662
df  = 2
p   = 0.264


 
So despite seeing that the second SNP had a significant single marker result, this disappears when its effect is estimated conditional on this haplotype background. This is consistent with the single SNP effect being due to linkage disequilibrium with these other SNPs.


 
Finally, it is possible to manually set up parameter constraints. For instance, say that a previous study found one specific haplotype to be associated with this trait: we may then wish to test that haplotype against all others in a 1 df test. This can be achieved using the --constrain option:
 

whap --file candidate --mean 0 --var 1 --at 2 --constrain 1,1,2,1,1,1,1,1/1,1,1,1,1,1,1,1

WHAP! | v2.03 | 05/09/03 | S. Purcell, P. Sham | purcell@wi.mit.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait: Mean fixed to 0.000. Variance fixed to 1.000. 
275 of 300 individuals are informative
          
Hap       Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
---       -----   ------          ------          -------         ------- 
2122221   0.313   0.000           0.000   [1]     0.000           0.000   [1]     
2112121   0.169   0.000           0.000   [1]     0.000           0.000   [1]     
2221211   0.122   -0.265          -0.265  [2]     0.000           0.000   [1]     
2212222   0.115   0.000           0.000   [1]     0.000           0.000   [1]     
2122222   0.112   0.000           0.000   [1]     0.000           0.000   [1]     
1112121   0.099   0.000           0.000   [1]     0.000           0.000   [1]     
2222221   0.041   0.000           0.000   [1]     0.000           0.000   [1]     
2212221   0.029   0.000           0.000   [1]     0.000           0.000   [1]     
---       -----                   ------                          ------- 
                                  783.930                         787.673 

Locus % variance = 0.015	0.000
Proportion of haplotypes covered = 0.955
LRT = 3.743
df  = 1
p   = 0.053

Note how the command line set of constraints matches with the numbers in [square brackets] in the output. The first coefficient, fixed to zero, must always be specified as 1 under both the alternate and null.


 
To set up a more complex set of constraints, e.g. to group all haplotypes with 12 at the 3rd and 4th SNPs, and test this group against all other haplotypes: (It is up to the user to ensure that the null model is nested in the alternate.)


 
whap --file candidate --mean 0 --var 1 --at 2 --constrain 1,2,1,2,1,2,1,2/1,1,1,1,1,1,1,1

Hap       Freq    Alt(B)          Alt(W)          Null(B)         Null(W) 
---       -----   ------          ------          -------         ------- 
2122221   0.313   0.000           0.000   [1]     0.000           0.000   [1]     
2112121   0.169   -0.254          -0.254  [2]     0.000           0.000   [1]     
2221211   0.122   0.000           0.000   [1]     0.000           0.000   [1]     
2212222   0.115   -0.254          -0.254  [2]     0.000           0.000   [1]     
2122222   0.112   0.000           0.000   [1]     0.000           0.000   [1]     
1112121   0.099   -0.254          -0.254  [2]     0.000           0.000   [1]     
2222221   0.041   0.000           0.000   [1]     0.000           0.000   [1]     
2212221   0.029   -0.254          -0.254  [2]     0.000           0.000   [1]     
---       -----                   ------                          ------- 
                                  779.126                         787.673 

Locus % variance = 0.031	0.000
Proportion of haplotypes covered = 0.955
LRT = 8.547
df  = 1
p   = 0.00346


 
Note that adding the command --hs will perform all possible 1 degree of freedom haplotype-specific tests, i.e. in place of the omnibus test. When used in conjunction with the --window option, the effect of --hs is slightly modified: only the most significant haplotype-specific association is recorded for each window position: the significance of this "largest specific haplotype" is tested empirically, i.e. --perm is required, by comparing the largest observed value to the largest value in each permuted dataset.


 
That's enought for this example. Next lets consider a different scenario -- a sample of affected individuals, with measured parental genotypes typed on a large number of markers.

Example 2 : A qualitative trait measured in 129 TDT trios; 103 SNPs across a large region

These data are taken from the Rioux et al and Daly et al Nature Genetics articles.
 
The sample conists of 129 affeced-offspring parent trios, with 103 SNPs genotyped on 5q31. We shall perform a single marker analysis of all 103 SNPs, and then an 8-marker sliding window haplotype analysis. Download the PED file chr5q31.ped, the DAT file chr5q31.dat and MAP file chr5q31.map.


 
As highlighted in the conditional analysis section, affected-only TDT trio designs must be analysed with the conditional framework, i.e. modelling genotype conditional on trait. It is also necessary to fix the population prevalence, and to specify that only within-family effects are to be estimated. Let's say the population prevalence is 1%: these options are specified as
whap --file chr5q31 --alt 1 --cond --prev 0.01 --model w
which performs a TDT-type analysis of the first SNP in the sample. To extend this to the entire region (and perform some region-wide empirical significance testing) simply modify the command to

whap --file chr5q31 --window --alt 1 --cond --prev 0.01 --model w --wperm 500
Results of sliding window analysis

Global permutation tests
------------------------
P_MAX = 23.290	 p = 0.0019960080
P_SUM = 681.487	 p = 0.0019960080

Local permutation tests
-----------------------
>> IGR1118a_1 274044 P_1= 4.988		 p= 0.0459081836	 		
>> IGR1119a_1 274541 P_2= 5.677		 p= 0.0159680639	 		
>> IGR1143a_1 286593 P_3= 8.268		 p= 0.0059880240	 		
>> IGR1144a_1 287261 P_4= 5.417		 p= 0.0219560878	 		
>> IGR1169a_2 299755 P_5= 3.569		 p= 0.0958083832	 		
...
...
...
>> GENS020ex3_3 877571 P_100= 1.591	 p= 0.2115768463	 		
>> GENS020ex3_2 877671 P_101= 3.139	 p= 0.0898203593	 		
>> GENS020ex3_1 877809 P_102= 9.297	 p= 0.0019960080	 		
>> GENS020ex1_1 890710 P_103= 2.566	 p= 0.0958083832	 		

Two global empirical significance values are presented: P_MAX and P_SUM. The first compares the maximum observed score in the observed data against each maximum of the 103 scores in permuted datasets. The proportion of times the permuted maximum exceeds the observed maximum gives the significance value (actually the formula (R+1)/(N+1) is used). A similar logic applies for the second global significance value, except it is based on the sum of the 103 scores, rather than the maximum. Note that 0.0019960080 equals 1/501 -- it would be necessary to perform more permutations to obtained more significant possible p-values.


 
Note that although these maximum and summed p-values give an indication of region wide significance, and so correct for the multiple testing inherent in large genomic association scans, the individual SNP p-values are not individually corrected for mulitple testing.


 
Repeating the analysis, but now with an 8-marker sliding window, and looking only at common haplotypes (those with a frequency above 5%) we obtain the following results:

whap --file chr5q31 --window --alt 1,2,3,4,5,6,7,8 --cond --prev 0.01 --model w --wperm 500 --at 5
Results of sliding window analysis

Global permutation tests
------------------------
P_MAX = 21.950	         p = 0.001996
P_SUM = 1261.208	 p = 0.001996

Local permutation tests
-----------------------
IGR1118a_1  274044  P_1=   4.053   p=  0.0538922156  
IGR1119a_1  274541  P_2=   5.574   p=  0.0299401198  
IGR1143a_1  286593  P_3=   5.340   p=  0.0179640719  
IGR1144a_1  287261  P_4=   4.778   p=  0.0319361277  
IGR1169a_2  299755  P_5=   4.504   p=  0.0279441118  
IGR1218a_2  324341  P_6=   4.158   p=  0.0319361277  
...
...
...

Again the global measures are significant. A good way to visualise these results is to plot the base-position column against the -log10(p-value).


Note that a value of -1 for the primary test indicates that no windows were able to span these markers, i.e. given that windows skip gaps greater than 100kb, and so the p-values are 'nan'.



Created by Shaun Purcell; Last updated by Lori Thomas: March 2006