|
L-POP : Detection of population substructure
Shaun Purcell & Pak Sham, Jan 2001.
SGDP Research Centre, Institute of Psychiatry, London, UK.
|
L-POP detects population stratification in samples of
unrelated individuals for whom a number of unlinked genotypes have
been measured.
Using a latent class analysis model, population strata are
defined such that unlinked markers are in Hardy-Weinberg equilibrium
and linkage equilibrium within classes.
L-POP calculates the posterior probabilities for each
individual belonging to each of K classes, which can be used
to correct for spurious effects of population
stratification in tests of association.
Download the appropriate L-POP version:
- *nix C/C++ source version lpop.tar.gz
(use tar -zxvf to extract; cd to the 'lpop' folder and type make)
- Windows (DOS) version lpop-dos-1-2.zip
(use Winzip to extract)
Please note: L-POP is a developing program: please
contact me to
report any bugs or problems.
To run L-POP use the following syntax
lpop < param
where param is a parameter file describing the data and
model. To redirect output to a file (recommended) type
lpop < param > results
where results can be any file name. It is often useful
to output more than one set of results to the same file: in this case,
use the redirect-and-append operator
lpop < param >> results
The param file contains information on
- the number of classes in the model
- whether there is admixture between classes
- the format of the genotype data file
- commands to choose additional options and extras
The first line of a param file is the title of the job.
Comments are made # like this # (note: there needs to be a
space between the # and adjacent characters).
The basic keywords in a param file are shown in the table
below.
| COMMAND |
Argument(s) |
Default |
Description |
| DATAFILE |
String |
data.raw |
Specify the file containing the data |
| CLASS |
Positive Integer |
2 |
Set number of ancestral classes |
| MISSING |
Integer |
-9 |
Specify missing allele code used in datafile |
Data format commands
| COMMAND |
Argument(s) |
Default |
Description |
| ALLELE |
Positive Nonzero Integer |
5 |
Specify the maximum number of alleles at a locus in the datafile |
| STRUCTURE |
|
OFF |
Turn ON structure datafile format |
| HAPLOID |
|
OFF |
Datafile contains haploid (as opposed to diploid) organisms |
| PHENO |
Positive Integer |
0 |
Number of phenotypes in datafile |
| FIX2CLASS |
|
OFF |
Turn ON fix-to-class option (datafile must contain a fix-to-class
column in this case) |
| XCHR |
int |
-1 |
Value for missing allele in the case of X chromosome data |
Model option commands
| COMMAND |
Argument(s) |
Default |
Description |
| A1AUTO |
Positive Integer |
0 |
Set number of admixed ancestral classes |
| A1LEVEL |
Positive Nonzero Integer |
2 |
Set admixture resolution |
| A1CLASS |
Two class numbers |
{no default} |
Specify a Type I admixture class between two other classes |
| A2CLASS |
Two class numbers |
{no default} |
Specify a Type 2 admixture class between two other classes |
| HWD |
|
OFF |
Turn ON option to relax HW-E constraint within classes |
Output format commands
| COMMAND |
Argument(s) |
Default |
Description |
| TAG |
String |
|
Set TAG to be used in tracking output
|
| VERBOSE1 |
|
OFF |
Turn ON ouput of genotypic configuration and allele frequencies |
| VERBOSE2 |
|
OFF |
Turn ON ouput of E-M convergence information |
| VERBOSE3 |
|
OFF |
Turn ON ouput of E-M iteration information |
| MIN |
|
OFF |
Turn ON minimalist ouput |
| DISTANCE |
|
OFF |
Turn ON distance analysis mode |
| ENTROPY |
|
OFF |
Turn ON output of individual entropy measure |
| LOCINFO |
|
OFF |
Turn ON output of locus informativeness |
| CSLOCINFO |
|
OFF |
Turn ON output of class-specific locus informativeness |
Miscellaneous commands
| COMMAND |
Argument(s) |
Default |
Description |
| RAND |
Integer |
12345678 |
Set the random number generator seed (to clock time if RAND is 0) |
| REPEAT |
Positive Nonzero Integer |
10 |
Number of times to attempt E-M convergence |
| TOL |
|
0.0001 |
Set tolerance for E-M convergence |
The DATAFILE keyword points to the file containing the
genotype data, which can be in one of two formats: L-POP or
STRUCTURE. For both formats, each line must conform to the
following standard:
ID (Fix-to-class column) (Phenotypes) Genotype data
where the items in parentheses are optional. Diploid data in
L-POP format, for two individuals on three loci with no
phenotypes:
ID1 1 2 3 1 1 2
ID2 2 2 1 4 1 1
Same data in structure format:
ID1 1 3 1
ID1 2 1 2
ID2 2 1 1
ID2 2 4 1
Haploid data (L-POP (one individual per line) and
structure (one chromosome per line) formats are equivalent in
this case) for two individuals on three loci with one phenotype:
ID1 2.344 2 2 1
ID2 1.762 1 2 1
Values can be separated either by whitespace or tabs. Phenotype
values are currently not used by L-POP.
Fix-to-class option
If a fix-to-class column is included in the data (requiring the
fix-to-class variable in the param file to be set,
the values in the second column have the following meanings:
-1 ignore this individual (as if all genotype data missing)
0 estimate posterior class probabilities as normal
1 to K fix posterior class probability to 1 for this class
(and so 0 for all other classes)
Note that if phenotype, ID or fix-to-class information is specified in
the structure format for diploid individuals, this
information must be repeated identically on both lines. As an example:
ID1 0 1 2 3 1 1 2
ID2 0 2 2 1 4 1 1
ID3 -1 2 1 1 4 2 1
ID4 2 1 1 4 4 2 2
In this case, the FIX2CLASS keyword will have appeared in the
param file. Individual ID1 and ID2 will
be treated as usual; individual ID3 will be treated as
missing; individual ID4 is assigned to class 2 (so
CLASS must be set to at least 2).
A trivial example
Consider the following datafile, data.raw
ID1 1 1 1 1 1 1 1 1 1 1
ID2 1 1 1 1 1 1 1 1 1 1
ID3 2 2 2 2 2 2 2 2 2 2
ID3 2 2 2 2 2 2 2 2 2 2
ID5 1 2 1 2 1 2 1 2 1 2
ID6 0 0 0 0 0 0 0 0 0 0
A param file might look like
A trivial example
DATAFILE data.raw
CLASS 2
MISSING 0
VERBOSE1
RAND 7263564
To run L-POP on these data and save the results in a file
named results1 type lpop < param > results1.
The results generated are discussed in the next section.
The output generated from the above example will look something like
++ L-POP Version 1.2 -- Shaun Purcell & Pak Sham -- Nov. 2001 ++
O:: Parameter details
O:: title = A trivial example
O:: datafile = data.raw
O:: datafile format = L-POP
O:: number of individuals = 6
O:: number of loci = 5
O:: max number of alleles = 5
O:: missing allele value = 0
O:: X chromosome value = -1
O:: number of phenotypes = 0
O:: organism type = Diploid
O:: fix-to-class = OFF
O:: number of pure classes = 2
O:: number of admix1 classes = 0
O:: admix1 level = 1
O:: number of admix2 classes = 0
O:: total number of classes = 2
O:: group individuals = OFF
O:: number of E-M repeats = 10
O:: E-M tolerance = 0.0001
O:: verbosity level 1 = ON
O:: verbosity level 2 = OFF
O:: verbosity level 3 = OFF
O:: output tag = <none>
O:: random number seed = 7263564
X:: Cls(N) Type P1 P2
C:: 1 Pure 1.0000 0.0000
C:: 2 Pure 0.0000 1.0000
X:: Genotypic configuration information
G:: 1 N=1 1/1 1/1 1/1 1/1 1/1
G:: 2 N=1 1/1 1/1 1/1 1/1 1/1
G:: 3 N=1 2/2 2/2 2/2 2/2 2/2
G:: 4 N=1 2/2 2/2 2/2 2/2 2/2
G:: 5 N=1 1/2 1/2 1/2 1/2 1/2
G:: 6 N=1 0/0 0/0 0/0 0/0 0/0
X:: Allele frequency information
A:: 1 P(0)= 2(0.1667) P(1)= 5(0.4167) P(2)= 5(0.4167)
A:: 2 P(0)= 2(0.1667) P(1)= 5(0.4167) P(2)= 5(0.4167)
A:: 3 P(0)= 2(0.1667) P(1)= 5(0.4167) P(2)= 5(0.4167)
A:: 4 P(0)= 2(0.1667) P(1)= 5(0.4167) P(2)= 5(0.4167)
A:: 5 P(0)= 2(0.1667) P(1)= 5(0.4167) P(2)= 5(0.4167)
X:: Fit statistics
K:: ML 26.8323
K:: N(P) 21
K:: AIC 68.8323
K:: iter 12
X:: Prior latent class probabilities P(C)
P:: 1 0.4000
P:: 2 0.6000
X:: Posterior class probabilities P(C|R)
X:: ID Est(k) h(1|R) h(2|R)
I:: ID1 1 1.0000 0.0000
I:: ID2 1 1.0000 0.0000
I:: ID3 2 0.0000 1.0000
I:: ID4 2 0.0000 1.0000
I:: ID5 2 0.0000 1.0000
I:: ID6 2 0.4000 0.6000
X:: Class-specific allele frequencies P(R|C)
M:: l=1 j=1 p(A=1|j)= 1.0000 p(A=2|j)= 0.0000
M:: l=1 j=2 p(A=1|j)= 0.1667 p(A=2|j)= 0.8333
M:: l=2 j=1 p(A=1|j)= 1.0000 p(A=2|j)= 0.0000
M:: l=2 j=2 p(A=1|j)= 0.1667 p(A=2|j)= 0.8333
M:: l=3 j=1 p(A=1|j)= 1.0000 p(A=2|j)= 0.0000
M:: l=3 j=2 p(A=1|j)= 0.1667 p(A=2|j)= 0.8333
M:: l=4 j=1 p(A=1|j)= 1.0000 p(A=2|j)= 0.0000
M:: l=4 j=2 p(A=1|j)= 0.1667 p(A=2|j)= 0.8333
M:: l=5 j=1 p(A=1|j)= 1.0000 p(A=2|j)= 0.0000
M:: l=5 j=2 p(A=1|j)= 0.1667 p(A=2|j)= 0.8333
D:: Inter-class genetic distance matrix
D:: 1 2
D:: 1 0.0000 0.7692
D:: 2 0.7692 0.0000
L-POP output is designed to be efficiently handled with
commonly-available text processing utilities such as grep
Each line of a results file begins with an identifier and a
job TAG. For example, if TAG is set to 'JobA'
in the param file:
X:JobA: ...information...
M:JobA: ...more information...
Here the M and X characters indicate the
type of information that line contains. The codes are as
follows:
O: Line contains options in effect (set by the
param file)
X: A header line, typically describing the information
in subsequent lines.
C: Class structure information (definitions of
pure and admixed classes)
G: The original genotypic data entered, which can
be useful for spotting formatting errors, etc. (from
VERBOSE1 option)
A: Allele frequencies and missing data report for
each locus, calculated from the entire sample
K: Log-likelihood, AIC, degrees of freedom and number of iterations.
P: Prior class probabilities P(C) for K classes.
k: As K: but for each attempted E-M
convergence (i.e. from these the lowest AIC will be printed in
K: (from VERBOSE2 option)
I: The posterior class probabilities for each
individual, for solution with K classes. Also contains the number of
the class with the highest posterior probability,
Est(k).
M: Class-specific allele frequencies.
D: Matrix of inter-class genetic distance.
The utility grep can be used to easily extract the relevant
results from the output file. For example, to view only the posterior
class probabilities (listed with the I: line prefix) we would type
grep I: results1
if the output file was called results1, which might give the
following
I:: ID1 1 1.0000 0.0000
I:: ID2 1 1.0000 0.0000
I:: ID3 2 0.0000 1.0000
I:: ID4 2 0.0000 1.0000
I:: ID5 2 0.0000 1.0000
I:: ID6 2 0.4000 0.6000
Of course, it would be natural to want to combine these posterior
class probabilities with the rest of the data: the following commands
on a workstation with the standard Unix/Linux text-processing
utilities, would achieve the following:
grep I: results1 | gawk ' { print $4, $5 } ' | paste data.raw - > newdata.raw
where data.raw is the original datafile, and
newdata.raw is created containing also the posterior
probabilities:
ID1 1 1 1 1 1 1 1 1 1 1 1.0000 0.0000
ID2 1 1 1 1 1 1 1 1 1 1 1.0000 0.0000
ID3 2 2 2 2 2 2 2 2 2 2 0.0000 1.0000
ID4 2 2 2 2 2 2 2 2 2 2 0.0000 1.0000
ID5 1 2 1 2 1 2 1 2 1 2 0.0000 1.0000
ID6 0 0 0 0 0 0 0 0 0 0 0.4000 0.6000
To run the same data but assuming only a single class, the following
param file might used
A trivial example : only one class
DATAFILE data.raw
CLASS 1
MISSING 0
VERBOSE1
RAND 7263564
TAG cl1
The TAG will be useful for idenitifying this solution later.
After running L-POP under several different models
(e.g. different K, different admixture patterns, different
Hardy-Weinberg assumptions, etc) it is possible to compare solutions
using the CompSol utility. CompSol is designed to
extract solutions based on best-fit class membershpi (i.e
Est(k) from the posterior probabilities output) and make
comparisons. The TAG option can be used to keep track of the
different models ran. Say for three different models (with 1, 2 and 3
classes respectively, specified in the param files
param1, param2 and param3) the
TAGs are cl1, cl2 and cl3. The
results from these analyses are concatenated in a single file, by
either
lpop < param1 > all_results
lpop < param2 >> all_results
lpop < param3 >> all_results
or something like
lpop < param1 > results1
lpop < param2 > results2
lpop < param3 > results3
cat results* > all_results
The grep utility is particularly useful now: for example, to
compare model-fit between the three solutions,
grep 'AIC' all_results
will give something like
K:cl1: AIC 82.3832
K:cl2: AIC 68.8323
K:cl3: AIC 81.4768
from which one would conclude the cl2 provides the best
fit to the data (i.e. lowest AIC value).
To compare solutions in terms of individual class assignment, the
shell script extractsol followed by compsol.
The command extractsol < all_results
scans the output file containing
multiple model results and creates a simple file containing individual
class assignment information. The actual commands run by the
extractsol are
grep I: - | gawk ' { print $1, $3 } ' | cut -d":" --output-delimiter=" " -f2,3 | grep
-v Est | grep -v class
the end result of which is, for previous example,
cl1 1
cl1 1
cl1 1
cl1 1
cl1 1
cl1 1
cl2 1
cl2 1
cl2 2
cl2 2
cl2 2
cl2 2
cl3 2
cl3 2
cl3 1
cl3 1
cl3 3
cl3 1
i.e. a list for the three models, which class each individual has been
assigned to. Piping this information into compsol by typing
extractsol < all_results | compsol
will produce the comparison of different solutions:
COMPSOL : Compare L-POP solutions
3 solutions and 6 individuals.
Solution "cl1" has 1 classes.
Solution "cl2" has 2 classes.
Solution "cl3" has 3 classes.
cl2 1 2
cl1 ------- -------
1 | 2 4
cl3 2 1 3
cl1 ------- ------- -------
1 | 2 3 1
cl3 2 1 3
cl2 ------- ------- -------
1 | 2 0 0
2 | 0 3 1
RAND coefficients
cl1 cl2 cl3
------- ------- -------
cl1 | 1.0000 0.4667 0.2667
cl2 | 0.4667 1.0000 0.8000
cl3 | 0.2667 0.8000 1.0000
RAND (adj) coefficients
cl1 cl2 cl3
------- ------- -------
cl1 | nan 0.0000 0.0000
cl2 | 0.0000 1.0000 0.5872
cl3 | 0.0000 0.5872 1.0000
The top three tables are contingency tables giving the frequency of
individuals cross-classified by the two solutions. For example, the
third table comparing solutions cl2 and cl3
indicates that class 2 in cl3 is equivalent to class 1 in
cl2: the difference between the two solutions is the
splitting of class 2 in cl2 to produce class 1 and 3 in
cl3.
The RAND coefficients are measure of solution similarity. The adjusted
RAND index is probably the more reliable (note: it is not defined for
solutions with only a single class).
grep, gawk and other useful utilities for MS-DOS are
available on the web for free. In particular, the Cygwin project
(http://www.cygwin.com/) provides a comprehensive Unix-like
environment for MS Windows. Also http://garbo.uwasa.fi/pc/unix.html
may be of use.
ToDo List
- option to fix class prior probs to zero for certain admixture
classes.
- option to restart model with solution from previous.
- calculate Wright's Fst and these other statistics.