A model-fitting implementation of the DeFries-Fulker model for selected twin data

Shaun Purcell and Pak Sham


Introduction | means.gawk | prepare.gawk | df.mx | df-sl.mx | (G)AWK | Mx
Introduction
These scripts are intended to facilitate a model-fitting approach to the DeFries-Fulker (DF) model of twin analysis, outlined in Purcell & Sham. Two gawk scripts (1) calculate the zygosity-specific proband and population means and (2) transform the dataset based on these means. The Mx scripts implement the basic DF model and the sex-limited model including DZ-O twins.

means.gawk
Usage : gawk -f means.gawk t=2 d=1 data1.raw to specify a threshold of 2 or above. To select probands as scoring below t, d=-1. This will create a file called prepdata.dat which is used by prepare.gawk. If you wish to specify your own proband and population means a priori, you can edit the file prepdata.dat: the file should contain four numbers, one per line, which are the MZ population mean, the DZ population mean, the MZ proband mean, the DZ proband mean.

As an example, consider the following dataset (ID, zygosity, twin 1, twin 2)
1  1  5 4
2  1  6 7
3  1  7 8
4  1  8 9
5  1  9 6
6  1  4 5
7  2  4 7
8  2  5 6
9  2  6 9
10 2  7 4
11 2  8 5
12 2  9 8
The output of running the means.gawk script is shown below
Total N ind.    :  24
Grand mean      :  6.5
 
MZ ind N        :  12
MZ mean         :  6.5
 
DZ ind N        :  12
DZ mean         :  6.5
 
Ind > t         :  8
Proband mean    :  8.5
Cotwin mean     :  7.25
 
MZ > t          :  4
MZ proband mean :  8.5
MZ cotwin mean  :  7.5
 
DZ > t          :  4
DZ proband mean :  8.5
DZ cotwin mean  :  7
 
Estimate of rMZ :  0.5
Estimate of rDZ :  0.25
Estimate of h2g :  0.5
which gives a simple estimate of heritability from the DF model. Also, the file prepdata.dat will have been written to the local directory containing the means used by the next script.

prepare.gawk
Usage : gawk -f prepare.gawk t=2 d=1 data1.raw > data1.trz to create a transformed dataset called data1.trz.

To continue the example given above, running the prepare.gawk script on the above data, given the same threshold is specified will produce the transformed datafile that the Mx script requires.
1  1  -0.75  -1.25  0 0
2  1  -0.25   0.25  0 0
3  1  0.25    0.75  0 1
4  1  0.75    1.25  1 1
5  1  1.25   -0.25  1 0
6  1  -1.25  -0.75  0 0
7  2  -1.25   0.25  0 0
8  2  -0.75  -0.25  0 0
9  2  -0.25   1.25  0 1
10 2  0.25   -1.25  0 0
11 2  0.75   -0.75  1 0
12 2  1.25    0.75  1 1

df.mx
Usage : mx < df.mx > results1 to perform DF analysis under Unix/Linux. Edit df.mx to constrain the genetic parameter and re-run the script in order to determine the difference in -2 log-likelihood of the models. If we were to run the Mx script on our example dataset (taking care to make sure the correct datafile name is specified) we obtain the following output (extracted)
  MATRIX Z
 This is a computed FULL matrix of order    3 by    1
  [=A_C_I-(A+C)]
               1
 1    5.0000E-01
 2   -2.3615E-06
 3    5.0000E-01
  
 Your model has    4 estimated parameters and     48 Observed statistics
  
 -2 times log-likelihood of data >>>    58.098
 Degrees of freedom >>>>>>>>>>>>>>>>        44
In other words, the Z matrix contains the parameter estimates, which agree with the earlier estimate - heritability is 50%, the remaining variance can be ascribed to nonshared environmental factors. In such a small sample, the power to say that the heritability estimate is significantly nonzero will be low. Fixing the A matrix by editing the Mx script, the results are now :
  MATRIX Z
 This is a computed FULL matrix of order    3 by    1
  [=A_C_I-(A+C)]
          1
 1   0.0000
 2   0.3750
 3   0.6250
  
 Your model has    3 estimated parameters and     48 Observed statistics
  
 -2 times log-likelihood of data >>>    58.360
 Degrees of freedom >>>>>>>>>>>>>>>>        45
For 1 degree of freedom difference (45 - 44 = 1) we have a difference in -2 times the log-likelihood of only 58.360 - 58.098 = 0.262. This illustrates why twin studies require more than a dozen pairs.

df_sl.mx
Usage : mx < df_sl.mx > results1 to perform DF analysis. Edit df_sl.mx to perform the tests outlined in Purcell & Sham.

gawk
The gawk or awk utility should be available on any Unix/Linux operating system. To download a version of gawk for MS-DOS follow the link above.

Mx
Mx is the model-fitting package used to implement the DF model. It can be freely downloaded for various platforms from the link above.


Purcell S & Sham PC (submitted). A model-fitting implementation of the DeFries-Fulker model for selected twin data.
Please note: Users of certain older or newer version of MS-DOS may encounter problems with long filenames -- rename the `.gawk` extensions to `.gwk`. Users of Windows XP may also encounter many problems...
Page created by Shaun Purcell.