|
Background Download Examples Conditional test tutorial Usage Warnings Future developments Licence Citation Contact |
Usage
Input file formatsThe basic usage iswhap --file data where there are 3 required files in the Merlin/QTDT format: in this case, data.ped, data.dat and data.map. An example partial PED file, data.ped, is shown here: FAM1 1 0 0 1 0 4 3 2 2 1 4 FAM1 2 0 0 2 0 3 3 2 4 1 1 FAM1 3 1 2 1 2 3 3 2 2 1 4 FAM2 1 0 0 1 0 4 4 2 2 4 4 FAM2 2 0 0 2 0 3 4 2 2 0 0 FAM2 3 1 2 1 2 4 4 2 2 4 4which represents two parent-offspring trios, measured on a qualitative trait and three SNPs. The first five columns are mandatory: family ID, individual ID, paternal and maternal ID (founders coded "0 0") and sex (1=male, 2=female). Case-control samples must be entered in this same pedigree file format (i.e. each family only has a single member). Genotype data can be coded [1,2], [1,2,3,4] or [A,C,G,T]. The default missing value for genotype data is 0. The default value for missing phenotypic data is -9. Note: currently only unrelated individuals and parent-offspring trios can be processed. If an individal only has a single parent in a sample, both parents must be entered (make the missing parent missing at all genotypes). Likewise, no siblings, or other kinds of relatives, should be in the PED file. Failure to comply with this will lead to unpredictable consequences.... The DAT file contains a description of the PED file, which describes the columns in order: in this case A disease M SNP1 M SNP2 M SNP3The codes used in a DAT file are as follows A Affection status (coded 0=unknown 1=absent 2=present) B Disease status (same as Affection status, but coded 0=absent 1=present) T Quantitative trait C Covariate M Marker S Skip this marker X Skip this traitOnly the first trait (whether qualitative or quantitative trait) found in the DAT file will be analysed by whap. The MAP file gives the chromosome, relative marker positions in cM and base-position for each SNP listed in the DAT file. 1 SNP1 0.00 100003 1 SNP2 0.00 100234 1 SNP3 0.00 102066The MAP file must have the same number of markers as specified in the DAT file, in the same order. Even if a marker is skipped in the DAT file, it must be present in the MAP file. Note that in this example, all SNPs have been placed at the same genetic distance on chromosome 1 -- this will rule out the possibility of considering recombinant haplotypes in parent-offspring data, and will be the default option (see the future section). The base-position is used in the sliding window analysis, where large gaps between markers will be skipped. Support for multiallelic markerswhap can analyse multiallelic markers: the usat program is first used to downcode multiallelic markers into biallelic markers. For example, a multiallelic marker with four alleles will be downcoded as four SNPs. whap knows that the four SNPs in fact represent a single locus and performs all subsequent analyses accordingly. For example, consider the file1 1 0 0 1 1 1 2 3 4 A C 221 221 2 1 0 0 2 1 1 2 4 4 B D 218 221 3 1 0 0 1 1 2 1 0 0 C C 208 225 4 1 0 0 1 1 1 2 4 4 D D 221 221 5 1 0 0 2 2 1 2 4 4 B B 208 208 6 1 0 0 1 2 2 2 3 4 0 0 225 225 7 1 0 0 1 2 1 1 4 4 C C 208 221 8 1 0 0 2 2 1 1 3 4 A B 0 0 9 1 0 0 1 2 1 2 4 3 C D 198 208which has the corresponding DAT and MAP files A aff M loc1 M loc2 M loc3 M loc4and 1 loc1 0 100 1 loc2 0 101 1 loc3 0 102 1 loc4 0 103The usat program is used to downcode these files: usat --file mydatawhich produces four new files: downcode.ped, downcode.dat, downcode.map and downcode.all. Alternatively, you can choose the file name with the --out option, e.g.: usat --file mydata --out new will produce files new.ped, new.dat, etc. Let's have a look at the resulting files: downcode.ped 1 1 0 0 1 1 1 2 3 4 2 1 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 0 0 2 1 1 2 4 4 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 1 3 1 0 0 1 1 2 1 0 0 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 4 1 0 0 1 1 1 2 4 4 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 5 1 0 0 2 2 1 2 4 4 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 1 6 1 0 0 1 2 2 2 3 4 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 1 1 7 1 0 0 1 2 1 1 4 4 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 8 1 0 0 2 2 1 1 3 4 2 1 1 1 1 2 1 1 0 0 0 0 0 0 0 0 0 0 9 1 0 0 1 2 1 2 4 3 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1As can be seen, the multiallelic markers have been downcoded into a larger number of SNPs. Any biallelic markers in the original file are left as biallelic markers. These changes are reflected in the DAT file A aff M loc1 M loc2 M loc3_[A] M loc3_[C] M loc3_[B] M loc3_[D] M loc4_[221] M loc4_[218] M loc4_[208] M loc4_[225] M loc4_[198]and the MAP file also 1 loc1 0 100 1 loc2 0 101 1 loc3_[A] 0 102 1 loc3_[C] 0 102 1 loc3_[B] 0 102 1 loc3_[D] 0 102 1 loc4_[221] 0 103 1 loc4_[218] 0 103 1 loc4_[208] 0 103 1 loc4_[225] 0 103 1 loc4_[198] 0 103Note how the marker names have been changed: each allele becomes a SNP, and the name of the allele is added as _[name] Finally, the file downcode.all is generated: this file is used by whap, it indicates which SNPs go together to make up the multiallelic markers. In this case we have: 1 1 4 5which indicates that the first two markers are SNPs (i.e. this is coded '1' and not '2' as it indicates how many downcoded markers decribe the locus -- for SNPs, this will only be a single one). Multiallelic markers will generate 3 or more downcoded SNPs, depending on the number of alleles they possess. Here we see that the third locus has 4 alleles, the fourth locus has five alleles. Therefore, in total we have 1+1+4+5=11 downcoded SNPs, which corresponds to the number of new SNPs in the PED, DAT and MAP files. The user need not be concerned with the details of the ALL file -- however, if whap is run using the --usat option, then an ALL file (with the root filename as specified in the --file option) must be present in the current directory. Having downcoded the data, running whap with the --usat option as follows will take account of the multiallelic markers. In this case, the 3rd locus would be analysed: whap --file downcode --usat --alt 3
WHAP! | v2.05 | 16/Aug/04 | S. Purcell, P. Sham | purcell@wi.mit.edu
9 individuals w/out parents. 0 individuals with parents. Binary trait:
9 of 9 individuals/trios are informative
Hap Freq Alt(B) Alt(W) Null(B) Null(W)
--- ----- ------ ------ ------- -------
[C] 0.375 0.000 0.000 [1] 0.000 0.000 [1]
[B] 0.250 1.072 1.072 [2] 0.000 0.000 [1]
[D] 0.250 -1.250 -1.250 [3] 0.000 0.000 [1]
[A] 0.125 -0.886 -0.886 [4] 0.000 0.000 [1]
--- ----- ------ -------
10.397 12.365
Proportion of haplotypes covered = 1.000
LRT = 1.968
df = 3
p = 0.579
Running the same whap command without the --usat
option on the downcode.* files would instead have performed a
test of the A allele of the third locus (as this is the 3rd
position in downcode.map). That is, one might desire to
perform a series of allele-specific tests over the entire dataset: e.g.
whap --file downcode --window --perm 500 >> loc1 100 P_1= 0.281 p= 0.9261477046 >> loc2 101 P_2= 0.555 p= 0.4790419162 >> loc3_[A] 102 P_3= -0.000 p= 0.9820359281 >> loc3_[C] 102 P_4= 0.000 p= 0.8303393214 >> loc3_[B] 102 P_5= 1.075 p= 0.2275449102 >> loc3_[D] 102 P_6= 1.042 p= 0.4630738523 >> loc4_[221] 103 P_7= 3.222 p= 0.1596806387 >> loc4_[218] 103 P_8= 1.522 p= 0.2015968064 >> loc4_[208] 103 P_9= 2.638 p= 0.1596806387 >> loc4_[225] 103 P_10= 0.266 p= 0.6846307385 >> loc4_[198] 103 P_11= 1.510 p= 0.5429141717whereas running the same analysis with the --usat option would produce a locus-specific set of tests: whap --file downcode --window --usat --perm 500 >> loc1 100 P_1= 0.281 p= 0.7285429142 >> loc2 101 P_2= 0.555 p= 0.3712574850 >> loc3 102 P_3= 1.968 p= 0.7265469062 >> loc4 102 P_4= 4.953 p= 0.6447105788NOTE: running the original test.ped through whap would produce an error: whap can only analyse SNPs (i.e. even though those SNPs may represent downcoded multiallelic markers). Basic commandsTo perform a single marker analysis of the 3rd SNPwhap --file data --alt 3 To test the haplotypes formed by only the first two markers from the data set in data.ped (there must be corresponding files data.dat and data.map), use the --alt option. Note that the marker numbers must be comma-separated and not have any spaces between them whap --file data --alt 1,2 Note that omitting the --alt command implicitly specifies that all SNPs should be included to form haplotypes, i.e. in this case it is equivalent to --alt 1,2,3 To test the effect of dropping the first SNP whilst controlling for the second SNP (i.e. imposing a set of equality constraints under the null, across haplotypes that are identical at position 2). (Note: alt refers to the alternate hypothesis, null refers to the null nested submodel to be compared to the alternate): whap --file data --alt 1,2 --null 2 Note that omitting the --null command implicitly specifies that no SNPs should be included to form haplotypes under the null. To manually constrain specific haplotypes to have similar estimated coefficients use the --constrain option. You must know in advance how many haplotypes are formed given the --alt (and other any frequency thresholds, etc, that are applied -- see below). Imagine that, in this case, there are four haplotypes (e.g. by first running the program without the --constrain option). The list before the forward slash specifies the parameter list for the four haplotypes under the alternate; the parameter list after the forward slash is for the null. Note that the numbers after the --alt and --null options refer to SNP numbers, so the number of numbers indicates how many SNPs are to be used to build haplotypes; in contrast, the numbers after a --constrain option refer to parameter numbers, so the number of numbers must conform with how many haplotypes have been inferred from the data. whap --file data --alt 1,2 --constrain 1,1,2,2/1,1,1,1 Note that the first haplotype always has a regression coefficient fixed to 0 (to identify the model). The first item in both the null and alternate constraint lists must be a 1 therefore. In the above example, we test whether or not the first two haplotypes as a group are significantly different from the second two haplotypes as a group. In this example, we test the third haplotype versus all others. whap --file data --alt 1,2 --constrain 1,1,2,1/1,1,1,1 In this example, we test whether the effects of the first three haplotypes are homogeneous (i.e. a non-significant likelihood ratio test would be evidence that they were). whap --file data --alt 1,2 --constrain 1,2,3,4/1,1,1,2 The user is responsible for ensuring that the null model is nested within the alternate model, i.e. so that a valid likelihood ratio test can be performed. It is not allowed to specify both --constrain and --null together. For case/control data, add the option --cc-freqs to obtain a table of expected haplotype counts stratified by case versus control status. Individuals are not automatically removed based on missing genotype data. Rather, the number of informative individuals in the output refers to individuals included in the final analysis. Individuals will be removed if they have too many possible haplotype phase assignments (i.e. the phasing failed to adequately resolve haplotype phase, so the individual would contribute little information for association) or if none of the possible haplotype phases contain common haplotypes (i.e. above the --ap or --at thresholds). It does not refer to the amount of missing genotype data in an exact way, or to whether trios are informative for within-family association (i.e. having heterozygous parents). Confidence intervalsTo request confidence intervals for parameter estimates, add the --ci option along with the desired confidence interval range. For example, for 95% confidence intervals:whap --file candidate --ci 0.95 This will perform an omnibus test of haplotype association on this dataset; in this case, we will estimate one parameter per haplotype, except for the reference haplotype. The output in this case is:
WHAP! | v2.10 | 17/Oct/06 | S. Purcell, P. Sham | spurcell@pngu.mgh.harvard.edu
300 individuals w/out parents. 0 individuals with parents. Continuous trait:
292 of 300 individuals/trios are informative
Hap Freq Alt(B) Alt(W) Null(B) Null(W)
--- ----- ------ ------ ------- -------
2122221 0.305 0.000 0.000 [1] 0.000 0.000 [1]
2112121 0.165 -0.234 -0.234 [2] 0.000 0.000 [1]
2221211 0.119 -0.435 -0.435 [3] 0.000 0.000 [1]
2212222 0.112 -0.434 -0.434 [4] 0.000 0.000 [1]
2122222 0.109 0.085 0.085 [5] 0.000 0.000 [1]
1112121 0.097 -0.197 -0.197 [6] 0.000 0.000 [1]
2222221 0.040 0.136 0.136 [7] 0.000 0.000 [1]
2212221 0.028 -0.479 -0.479 [8] 0.000 0.000 [1]
2112222 0.014 -0.212 -0.212 [9] 0.000 0.000 [1]
2121211 0.012 -0.299 -0.299 [10] 0.000 0.000 [1]
--- ----- ------ -------
807.074 828.496
Proportion of haplotypes covered = 0.980
LRT = 21.422
df = 9
p = 0.0109
[1] = 0.330 ( 0.067, 0.593 )
[2] = 0.964 ( 0.883, 1.044 )
[3] = -0.234 ( -0.474, 0.006 )
[4] = -0.435 ( -0.719, -0.152 )
[5] = -0.434 ( -0.707, -0.162 )
[6] = 0.085 ( -0.195, 0.366 )
[7] = -0.197 ( -0.497, 0.103 )
[8] = 0.136 ( -0.290, 0.562 )
[9] = -0.479 ( -0.982, 0.024 )
[10] = -0.212 ( -0.856, 0.431 )
[11] = -0.299 ( -1.008, 0.411 )
The final part of the output contains the confidence intervals for all estimated parameters. In this case
of a quantitative trait, the mean and variance were also estimated and so these are the first two values (
0.330 and 0.964) with their respective confidence intervals. The first haplotypic effect is therefore the
third row in this case -0.234, which corresponds to the estimate for the 2112121 haplotype (i.e. the
second most common haplotype, as the most common haplotype is fixed to be the reference haplotype). The
following estimates correspond to the remaining haplotypes, in the same order as shown in the first part
of the output. The numbers in brackets are the lower and upper bounds of the confidence interval. Note
that for binary traits, or if the mean and/or variance is fixed, there may be fewer than two parameters
before the haplotype effects in the confidence interval output.
ModelsThe default setting is to equate between and within family effects. Use the --model option to alter this. The options are t, b and w for total, between and within effects. Total implies estimating both between and within effects but constraining them to be equal (the default, i.e. it is superfluous to specify the model in this case):whap --file data --alt 1,2 --model t If one has parental genotype data, a within-only test can be constructed: whap --file data --alt 1,2 --model w Both between and within components can be estimated but not constrained to be equal by using the bw option. In this example, the forward slash specifies the models under the alternate and null seperately. In this case, we are not dropping any markers (i.e. the --null option specifies the same markers as for the --alt option) but the model is changing (i.e. we are testing the impact of equating between and within components). In this example, a significant result would be consistent with population stratification, therefore. whap --file data --alt 1,2 --null 1,2 --model bw/t An alternate way to formulate the within family test is to drop the within component in the presence of the between component: whap --file data --alt 1,2 --null 1,2 --model bw/w For TDT affected-only scenarios (see the conditional analysis section below) the --model w is to be used (as there will be no between family variation). Finally, to specify a between-only test of association: whap --file data --alt 1,2 --model b Permutation testsEmpirical p-values can be easily generated by Monte-Carlo methods: use the --perm option followed by the number of permutationswhap --file data --alt 1,2 --perm 5000 The --perm option permutes the trait values in the sample, thereby breaking down the phenotype-haplotype relationship but keeping both haplotype and phenotype distributions individually intact. In the case of an affected-only analysis (e.g. TDT), however, there will be no phenotypic variation. An alternative test to use when parental genotypes are present is the --wperm option, which permutes the sign of the within components (c.f. permuting which are the transmitted and untransmitted haplotypes). The --model w option must be used in conjuction with the --wperm optipon if there is no phenotypic variation and parental genotypes are present. whap --file data --alt 1,2 --model w --wperm 5000 It is possible to use both --perm and --wperm together. Rare haplotypesBy default, haplotypes with less than 1 percent sample frequency are ignored during analysis. This can be changed by setting --at to a different value: for example, 5 percent:whap --file data --alt 1,2 --at 5 A variant --ap form of this command can be used to specify fractional percentages: i.e. --at 5 equals --ap 0.05 but saves some typing. So to specify all haplpotypes with frequency above half a percent in the sample (i.e. if one has a large sample) use: whap --file data --alt 1,2 --ap 0.005 Conditional analysisTo model genotype conditional on trait (instead of trait conditional on genotype, the standard approach) use the --cond option. For quantitative traits, a conditional approach is more robust in selected and mildly non-normal samples. The mean and variance should also be fixed to population values when performing conditional analysis:whap --file data --alt 1,2 --cond --mean 0 --var 1 A common application of the conditional test would be for trio TDT affected-only samples, i.e. there is no phenotypic variation, so the trait cannot be the dependent variable. For TDT samples, the population prevalence must also be specified, a within-family test must be used, and any permutations should permute the within-family genotypic component, rather than the trait value. That is: whap --file data --alt 1,2 --cond --prev 0.01 --model w --wperm 500 That is, the four options --cond, --prev, --model w and --wperm must be in place for TDT analyses. Otherwise, models will fail to converge, or produce spurious results. Secondary analysisThe secondary analysis looks for pairwise associations between haplotype similarity defined genetically and the similarity of the MLE regression coefficients under the alternate hypothesis. Simply add the --sec option to perform the secondary test.whap --file data --alt 1,2,3,4 --sec --cond --prev 0.01 --wperm 500 The secondary test will work better if all haplotypes are reasonably common (or if the sample is large -- i.e. so the absolute copy number is sufficient). If the secondary test cannot be performed, a 'nan' code will result. If within-family effects are estimated under the alternate, these are used for the secondary analysis in preference to between family effects. A permutation test (either --perm or --wperm must be used to evaluate the significance of the secondary test. Empirical p-values are given for both global and local secondary tests (in sliding window mode, only global secondary tests are considered). As well as being given individually, the secondary p-values are combined with the primary p-value (again, not in sliding window mode). Sliding window analysisTo cover a large region with a simple sliding window approach, add the --window option to the command line. In this case, the --alt model must contain the value 1. Now the --alt model is interpreted as a set of offsets to slide across all markers. So the command:whap --file data --alt 1,2,3,4 --window --cond --prev 0.01 --wperm 500 --sec performs a 4-marker sliding window analysis (in this case, adopting a conditional approach and also performing secondary tests). That is, the first window is 1,2,3,4; the second window is 2,3,4,5; the third window is 3,4,5,6; etc. To change the increment step use --winstep 4 which would give 1,2,3,4; then 5,6,7,8; then 9,10,11,12; etc. At each position, the results of any covering windows are averaged to produce a final statistic. This statistic is tested by permutation test only (global significance values will also be given for the maximum and summed statistics). In the above case, the test is a simple 'omnibus' test of all haplotypes in each 4-marker region. This will almost always be what is required -- it is possible to construct different tests however: e.g. whap --file data --alt 1,3,5 --null 1,5 --window --perm 5000 In this example, the first window drops SNP 3 from SNPs 1, 3 and 5; then drops SNP 4 from SNPs 2, 4 and 6; then drops SNP 5 from SNPs 3, 5 and 7; etc. Use the --allres command to get more output when performing a sliding window analysis. Note that certain progress indicators are sent to STDERR (whilst all output is always sent to STDOUT). Therefore, it usually makes sense to redirect the output: whap --file data --alt 1 --window --perm 500 > my_results_file.txt Note: the above command is a quick way to obtain all single marker results (as well as a measure of significance for the maximum result and summed results). Haplotype-specific analysisThe default analysis mode performs an omnibus test of all haplotype effects, i.e. if there are H haplotypes, a single H-1 degree of freedom test. To perform haplotype specific analyses, i.e. H 1 degree of freedom tests, each of a specific haplotype versus all other haplotypes, then use the --hap-specific option (previously called --largest in earlier versions of whap):whap --file data --alt 1,2,3,4 --hap-specific The above form, i.e. not in sliding window mode, will automatically perform all H analyses, setting the constraints for each as required. The output is somewhat verbose: a more minimalist table of results can be obtained by using --hs instead of --hap-specific. The behaviour of --hap-specific is slightly different in --window mode: instead of calculating the omnibus test statistic at each window position, the largest of the specific haplotype tests is taken as the statistic for that position. Significance is assessed empirically, i.e. the largest specific haplotype score will be taken also from each permuted dataset. whap --file data --alt 1,2,3,4 --window --perm 500 --hap-specific Dominant and recessive modelsIt is possible to fit dominant and recessive models only when performing haplotype-specific tests. For example,whap --file data --alt 1,2,3,4 --hs --dom or whap --file data --alt 1,2,3,4 --hap-specific --rec These options force a coding of {0,1,1} or {0,0,1} for the genotypes / haplotype pairs aa, Aa, AA, i.e. instead of a the default dosage model of {0,1,2}. It is important to note that the first haplotype tested, if the --dom flag is set will actually be tested with a recessive model, but all the other haplotypes will be tested with a dominant model. Likewise, if the --rec flag is used, then all haplotypes will be tested assuming a recessive model, except for the first one, which will be tested with a dominant model. The reason for this is that whap always treats the first haplotype as the reference category: therefore, the test of the first haplotype is actually testing all haplotypes that aren't haplotype 1 against the null of no association (in contrast, the test of haplotype 2 is simply a test of haplotype 2 against the null of no association. To say that haplotype 1 has a dominant effect is the same as saying that the set all haplotypes that aren't 1 has a recessive effect. The sign of the first haplotype test is reversed automatically, such that under the normal multiplicative model the user does not need to worry about this complication; for dominant or recessive models, the coefficient is reversed automatically, but the model is not: therefore a warning is printed to indicate whether or not a dominant or recessive model applies to each haplotype. Covariates and moderator variablesCovariates are specified in the DAT file by a 'C' code, and will be ignored by default. To include a covariate effect in a model, use either the --cov or the --alt-cov options. The first form estimates the covariate effects under both the alternate and null models.whap --file data --alt 1,2 --cov The second form estimates the covariate effects only under the alternate model. For example, whap --file data --alt 1,2 --alt-cov In other words, the first form provides a test of haplotype effects whilst adjusting for the covariate(s). The second form provides a joint test of the covariate(s) and the haplotype effects. It is also possible to test the effect of the covariate by itself: for example, whap --file data --alt 1 --null 1 --alt-cov It is not possible to specify missing values for covariates, as it is for phenotypes and genotypes. With missing covariate data, imputing a value for the missing values, e.g. the mean, is usually best. It is important to note that estimating covariate effects may be restricted when adopting a conditional approach to analysis. If there are significant haplotypic effects, then the covariate parameter(s) will be identified and able to be estimated; otherwise, the parameter values are likely to be incorrect (and the test will have no power). This is because the covariate will not have any direct effect on haplotype frequencies (i.e. we are modelling L(G|X,C)) -- only if the trait and haplotypes are associated will the covariate coefficients be identified. In practice: when performing conditional analysis, only include covariates in simple scenarios where a main effect of the haplotype has already been established. By adding the --gxe and --alt-gxe (note: these automatically imply --cov and --alt-cov respectively) a further set of coefficients are estimated (one for each haplotype). These coefficients represent the potential moderating effect of the covariate on the haplotype effect. A similar note applies with conditional analysis. Note: the population mean and variance of the covariate should be entered in the DAT file, as two numbers on the same line as the 'C' is declared. For example A cancer C smoking 0.2 0.16 M SNP1 M SNP2 M SNP3The covariate can either be a quantitative variable, or a binary variable. In the example above, the prevalence of smoking may be 20%. If the covariate is coded 0/1 for non-smoker/smoker, the mean will be 0.2 and the variance will be 0.2*0.8 = 0.16. To perform a scan allowing for potential epistatic effects, for example, it is possible to code the presence or absence of a specific risk allele/genotype/haplotype elsewhere in the genome as a covariate, coded 0/1. A command such as the following could then be used. whap --file data --window --alt 1 --alt-gxe --allres NOTE: Any covariate should be uncorrelated with the haplotypes at the test locus: this is less critical with case/control analysis but very important for TDT-type data, when the conditional L(G|X,E) is used. Mixed samples, extended pedigreesCombined case/control and TDT trios samples It is possible to combined unrelated and trio data in a single sample. Note that unrelated individuals will only contribute to the between family component of variation. A common scenario might involve combining a case/control and TDT affected-only offspring sample. The best approach is to use --model t in conjunction with the extra option --no_trio_b which will ensure that the trios only contribute to the within family component. Note that it will still be necessary to perform a conditional analysis and fix the prevalence in these conditions. It will also be necessary to specify both --perm and --wperm (note: the number of permutations must be enter for both, the same number should be used for each). A test of whether or not case/control and trio samples are contributing equivalent evidence can then be constructed by replacing the --model t (the default) with --model bw/t.Extended pedigrees One imperfect approach to extended pedigree data is to split large families up into all possible parent-offspring trios (whether or not the offspring is affected) and perform the analysis as if they were independent, using --model w so that each test is conditional on parental genotypes. Alternatively, only a single trio per family could be randomly ascertained. Miscellaneous optionsFixing nuissance parametersIt is often desirable and sometimes essential to fix the trait parameters to known population values. For quantitative traits, the mean and total trait variance can be fixed (note: this greatly aids model convergence). If the trait is standard normal in the general population:whap --file data --alt 1,2 --mean 0 --var 1 This approach is particularly useful when samples have been selected. (In this case, use the prior-to-selection population values for the mean and variance). For qualitative traits, only the population prevalence is specified: whap --file data --alt 1,2 --prev 0.02 When performing a conditional TDT affected-only analyis, the prevalence must be specified. So long as the disease is rare in the general population, mispecifying the prevalence will have little or no impact of the test statistics. Input/Output optionsTo get detailed output on the convergence of the model use the --verbose option. To get a detailed dump of the haplotype reconstruction for individuals and trios, use the --dump option. To get even more output, use the --debug (not recommended). As mentioned above, the --allres option gives more output in sliding window mode. Also, when permutation tests are being used, --allres gives more output. To set the missing genotype or phenotype missing values, use the following options:whap --file data --alt 1,2 --missing-geno X --missing-pheno -9 Misc optionsAmbiguity of E-M inferred phasing By default, whap considers all possible haplotype assignments as determined by the E-M algorithm. For example, for an individual with three-locus genotype [1/2, 1/2, 2,2], there may be 2 possible states with frequencies :Pat/Mat P(H|G) 112/222 0.98 212/122 0.02To specify that level of relative ambiguity in haplotype assignment, use the --th option, which takes a value between 0 and 1: this represents the fraction of the most likely phase assignment below which possible assignments will be ignored (in otherwords, a haplotype may be common, but if in any one individual it is very unlikely, this unlikely phase will be ignored). The default is 0, meaning that all phases with a nonzero probability are considered. To look only at the most likely haplotypes, set this option to 1. For example, whap --file data --alt 1,2 --th 1 would only consider the 112/222. Note that if more than one phase are 'equally most likely' (e.g. two phases at 0.50 and 0.50) then both will be included even when --th 1. Optimisation parameters The maximum likelihood estimation procedure involves a round of simmulated annealing to obtain starting values followed by downhill simplex optimisation. It is possible to perform repeated simulated annealing / simplex cycles by the --repeat option (default is 5 repeats). Using in combination with the --verbose option will indicate how this changes the analysis. whap --file data --alt 1,2 --repeat 5 Other options set values that probably shouldn't be changed: these are (with their default value shown): --eps 1e-8 to alter the small value (epsilon) used in calculating the Hessian matrix (involved in secondary analysis); --tol 1e-6 sets the tolerance of the simplex procedure; --iter 20000 sets the maximum number of iterations the simplex is allowed; --SAtemp 10 sets the starting temperature of the simmulated annealing cycles; --SAcycle sets the number of simmulated annealing cycles; --SAdecline sets the rate of temperature decline over the simmumatled annealing cycles; --SAiter 25 sets the number of iterations to be performed at each temperature cycle. Created by Shaun Purcell; Last updated by Lori Thomas: March 2006 |