L-POP

L-POP : Detection of population substructure

Shaun Purcell & Pak Sham, Jan 2001.
SGDP Research Centre, Institute of Psychiatry, London, UK.

 
Overview | Download | Usage | Input | Output | CompSol | Links

 
Overview
 
L-POP detects population stratification in samples of unrelated individuals for whom a number of unlinked genotypes have been measured.

Using a latent class analysis model, population strata are defined such that unlinked markers are in Hardy-Weinberg equilibrium and linkage equilibrium within classes.

L-POP calculates the posterior probabilities for each individual belonging to each of K classes, which can be used to correct for spurious effects of population stratification in tests of association.
 
Download
 
Download the appropriate L-POP version: Please note: L-POP is a developing program: please contact me to report any bugs or problems.

Usage
 
To run L-POP use the following syntax
lpop < param
where param is a parameter file describing the data and model. To redirect output to a file (recommended) type
lpop < param > results
where results can be any file name. It is often useful to output more than one set of results to the same file: in this case, use the redirect-and-append operator
lpop < param >> results
Input
 
The param file contains information on The first line of a param file is the title of the job. Comments are made # like this # (note: there needs to be a space between the # and adjacent characters).
The basic keywords in a param file are shown in the table below.

COMMAND Argument(s) Default Description
DATAFILE String data.raw Specify the file containing the data
CLASS Positive Integer 2 Set number of ancestral classes
MISSING Integer -9 Specify missing allele code used in datafile

Data format commands

COMMAND Argument(s) Default Description
ALLELE Positive Nonzero Integer 5 Specify the maximum number of alleles at a locus in the datafile
STRUCTURE   OFF Turn ON structure datafile format
HAPLOID   OFF Datafile contains haploid (as opposed to diploid) organisms
PHENO Positive Integer 0 Number of phenotypes in datafile
FIX2CLASS   OFF Turn ON fix-to-class option (datafile must contain a fix-to-class column in this case)
XCHR int -1 Value for missing allele in the case of X chromosome data

Model option commands

COMMAND Argument(s) Default Description
A1AUTO Positive Integer 0 Set number of admixed ancestral classes
A1LEVEL Positive Nonzero Integer 2 Set admixture resolution
A1CLASS Two class numbers {no default} Specify a Type I admixture class between two other classes
A2CLASS Two class numbers {no default} Specify a Type 2 admixture class between two other classes
HWD   OFF Turn ON option to relax HW-E constraint within classes

Output format commands

COMMAND Argument(s) Default Description
TAG String   Set TAG to be used in tracking output
VERBOSE1   OFF Turn ON ouput of genotypic configuration and allele frequencies
VERBOSE2   OFF Turn ON ouput of E-M convergence information
VERBOSE3   OFF Turn ON ouput of E-M iteration information
MIN   OFF Turn ON minimalist ouput
DISTANCE   OFF Turn ON distance analysis mode
ENTROPY   OFF Turn ON output of individual entropy measure
LOCINFO   OFF Turn ON output of locus informativeness
CSLOCINFO   OFF Turn ON output of class-specific locus informativeness

Miscellaneous commands

COMMAND Argument(s) Default Description
RAND Integer 12345678 Set the random number generator seed (to clock time if RAND is 0)
REPEAT Positive Nonzero Integer 10 Number of times to attempt E-M convergence
TOL   0.0001 Set tolerance for E-M convergence

The DATAFILE keyword points to the file containing the genotype data, which can be in one of two formats: L-POP or STRUCTURE. For both formats, each line must conform to the following standard:
ID (Fix-to-class column) (Phenotypes) Genotype data
where the items in parentheses are optional. Diploid data in L-POP format, for two individuals on three loci with no phenotypes:
ID1   1 2   3 1   1 2    
ID2   2 2   1 4   1 1
Same data in structure format:
ID1   1  3  1    
ID1   2  1  2
ID2   2  1  1
ID2   2  4  1 
Haploid data (L-POP (one individual per line) and structure (one chromosome per line) formats are equivalent in this case) for two individuals on three loci with one phenotype:
ID1  2.344   2   2   1     
ID2  1.762   1   2   1 
Values can be separated either by whitespace or tabs. Phenotype values are currently not used by L-POP.

Fix-to-class option

If a fix-to-class column is included in the data (requiring the fix-to-class variable in the param file to be set, the values in the second column have the following meanings:
     -1      ignore this individual (as if all genotype data missing)
     0       estimate posterior class probabilities as normal
     1 to K  fix posterior class probability to 1 for this class
               (and so 0 for all other classes) 
Note that if phenotype, ID or fix-to-class information is specified in the structure format for diploid individuals, this information must be repeated identically on both lines. As an example:
ID1  0  1 2   3 1   1 2    
ID2  0  2 2   1 4   1 1
ID3 -1  2 1   1 4   2 1
ID4  2  1 1   4 4   2 2
In this case, the FIX2CLASS keyword will have appeared in the param file. Individual ID1 and ID2 will be treated as usual; individual ID3 will be treated as missing; individual ID4 is assigned to class 2 (so CLASS must be set to at least 2).

A trivial example

Consider the following datafile, data.raw
ID1  1 1  1 1  1 1  1 1  1 1
ID2  1 1  1 1  1 1  1 1  1 1
ID3  2 2  2 2  2 2  2 2  2 2
ID3  2 2  2 2  2 2  2 2  2 2
ID5  1 2  1 2  1 2  1 2  1 2
ID6  0 0  0 0  0 0  0 0  0 0 
A param file might look like
A trivial example 

DATAFILE data.raw
CLASS  2
MISSING  0
VERBOSE1
RAND 7263564
To run L-POP on these data and save the results in a file named results1 type lpop < param > results1. The results generated are discussed in the next section.
 
Output
 
The output generated from the above example will look something like
++ L-POP Version 1.2 -- Shaun Purcell & Pak Sham -- Nov. 2001 ++ 

O:: Parameter details
O:: title                      = A trivial example 
O:: datafile                   = data.raw
O:: datafile format            = L-POP
O:: number of individuals      = 6
O:: number of loci             = 5
O:: max number of alleles      = 5
O:: missing allele value       = 0
O:: X chromosome value         = -1
O:: number of phenotypes       = 0
O:: organism type              = Diploid
O:: fix-to-class               = OFF
O:: number of pure classes     = 2
O:: number of admix1 classes   = 0
O:: admix1 level               = 1
O:: number of admix2 classes   = 0
O:: total number of classes    = 2
O:: group individuals          = OFF
O:: number of E-M repeats      = 10
O:: E-M tolerance              = 0.0001
O:: verbosity level 1          = ON
O:: verbosity level 2          = OFF
O:: verbosity level 3          = OFF
O:: output tag                 = <none>
O:: random number seed         = 7263564

X::	Cls(N)	Type	P1	P2
C::	1	Pure	1.0000	0.0000
C::	2	Pure	0.0000	1.0000

X:: Genotypic configuration information
G:: 1 N=1   1/1 1/1 1/1 1/1 1/1 
G:: 2 N=1   1/1 1/1 1/1 1/1 1/1 
G:: 3 N=1   2/2 2/2 2/2 2/2 2/2 
G:: 4 N=1   2/2 2/2 2/2 2/2 2/2 
G:: 5 N=1   1/2 1/2 1/2 1/2 1/2 
G:: 6 N=1   0/0 0/0 0/0 0/0 0/0 

X:: Allele frequency information
A:: 1	P(0)= 2(0.1667)	P(1)= 5(0.4167)	P(2)= 5(0.4167)	
A:: 2	P(0)= 2(0.1667)	P(1)= 5(0.4167)	P(2)= 5(0.4167)	
A:: 3	P(0)= 2(0.1667)	P(1)= 5(0.4167)	P(2)= 5(0.4167)	
A:: 4	P(0)= 2(0.1667)	P(1)= 5(0.4167)	P(2)= 5(0.4167)	
A:: 5	P(0)= 2(0.1667)	P(1)= 5(0.4167)	P(2)= 5(0.4167)	


X:: Fit statistics
K::	ML	26.8323	
K::	N(P)	21	
K::	AIC	68.8323	
K::	iter	12	

X:: Prior latent class probabilities P(C)
P::	1	0.4000
P::	2	0.6000

X:: Posterior class probabilities P(C|R)
X::	ID	Est(k)	h(1|R)	h(2|R)
I::	ID1	1	1.0000	0.0000
I::	ID2	1	1.0000	0.0000
I::	ID3	2	0.0000	1.0000
I::	ID4	2	0.0000	1.0000
I::	ID5	2	0.0000	1.0000
I::	ID6	2	0.4000	0.6000

X:: Class-specific allele frequencies P(R|C)
M::	l=1	j=1	p(A=1|j)= 1.0000	p(A=2|j)= 0.0000
M::	l=1	j=2	p(A=1|j)= 0.1667	p(A=2|j)= 0.8333

M::	l=2	j=1	p(A=1|j)= 1.0000	p(A=2|j)= 0.0000
M::	l=2	j=2	p(A=1|j)= 0.1667	p(A=2|j)= 0.8333

M::	l=3	j=1	p(A=1|j)= 1.0000	p(A=2|j)= 0.0000
M::	l=3	j=2	p(A=1|j)= 0.1667	p(A=2|j)= 0.8333

M::	l=4	j=1	p(A=1|j)= 1.0000	p(A=2|j)= 0.0000
M::	l=4	j=2	p(A=1|j)= 0.1667	p(A=2|j)= 0.8333

M::	l=5	j=1	p(A=1|j)= 1.0000	p(A=2|j)= 0.0000
M::	l=5	j=2	p(A=1|j)= 0.1667	p(A=2|j)= 0.8333


D:: Inter-class genetic distance matrix
D::		1	2
D::	1	0.0000	0.7692
D::	2	0.7692	0.0000
L-POP output is designed to be efficiently handled with commonly-available text processing utilities such as grep Each line of a results file begins with an identifier and a job TAG. For example, if TAG is set to 'JobA' in the param file:
X:JobA: ...information...
M:JobA: ...more information...
Here the M and X characters indicate the type of information that line contains. The codes are as follows:

O: Line contains options in effect (set by the param file)
X: A header line, typically describing the information in subsequent lines.
C: Class structure information (definitions of pure and admixed classes)
G: The original genotypic data entered, which can be useful for spotting formatting errors, etc. (from VERBOSE1 option)
A: Allele frequencies and missing data report for each locus, calculated from the entire sample
K: Log-likelihood, AIC, degrees of freedom and number of iterations.
P: Prior class probabilities P(C) for K classes.
k: As K: but for each attempted E-M convergence (i.e. from these the lowest AIC will be printed in K: (from VERBOSE2 option)
I: The posterior class probabilities for each individual, for solution with K classes. Also contains the number of the class with the highest posterior probability, Est(k).
M: Class-specific allele frequencies.
D: Matrix of inter-class genetic distance. The utility grep can be used to easily extract the relevant results from the output file. For example, to view only the posterior class probabilities (listed with the I: line prefix) we would type
grep I: results1
if the output file was called results1, which might give the following
I::	ID1	1	1.0000	0.0000
I::	ID2	1	1.0000	0.0000
I::	ID3	2	0.0000	1.0000
I::	ID4	2	0.0000	1.0000
I::	ID5	2	0.0000	1.0000
I::	ID6	2	0.4000	0.6000
Of course, it would be natural to want to combine these posterior class probabilities with the rest of the data: the following commands on a workstation with the standard Unix/Linux text-processing utilities, would achieve the following:
grep I: results1 | gawk ' { print $4, $5 } ' | paste data.raw - > newdata.raw
where data.raw is the original datafile, and newdata.raw is created containing also the posterior probabilities:
ID1  1 1  1 1  1 1  1 1  1 1	1.0000 0.0000
ID2  1 1  1 1  1 1  1 1  1 1	1.0000 0.0000
ID3  2 2  2 2  2 2  2 2  2 2	0.0000 1.0000
ID4  2 2  2 2  2 2  2 2  2 2	0.0000 1.0000
ID5  1 2  1 2  1 2  1 2  1 2	0.0000 1.0000
ID6  0 0  0 0  0 0  0 0  0 0	0.4000 0.6000
To run the same data but assuming only a single class, the following param file might used
A trivial example : only one class

DATAFILE data.raw
CLASS  1
MISSING  0
VERBOSE1
RAND 7263564
TAG cl1
The TAG will be useful for idenitifying this solution later.
 
CompSol
 
After running L-POP under several different models (e.g. different K, different admixture patterns, different Hardy-Weinberg assumptions, etc) it is possible to compare solutions using the CompSol utility. CompSol is designed to extract solutions based on best-fit class membershpi (i.e Est(k) from the posterior probabilities output) and make comparisons. The TAG option can be used to keep track of the different models ran. Say for three different models (with 1, 2 and 3 classes respectively, specified in the param files param1, param2 and param3) the TAGs are cl1, cl2 and cl3. The results from these analyses are concatenated in a single file, by either
lpop < param1 > all_results
lpop < param2 >> all_results
lpop < param3 >> all_results
or something like
lpop < param1 > results1
lpop < param2 > results2
lpop < param3 > results3
cat results* > all_results
The grep utility is particularly useful now: for example, to compare model-fit between the three solutions,
grep 'AIC' all_results
will give something like
K:cl1:	AIC	82.3832	
K:cl2:	AIC	68.8323	
K:cl3:	AIC	81.4768	
from which one would conclude the cl2 provides the best fit to the data (i.e. lowest AIC value).

To compare solutions in terms of individual class assignment, the shell script extractsol followed by compsol. The command
extractsol < all_results
scans the output file containing multiple model results and creates a simple file containing individual class assignment information. The actual commands run by the extractsol are
grep I: - | gawk ' { print $1, $3 }  ' | cut -d":" --output-delimiter=" " -f2,3 | grep 
 -v Est | grep -v class 
the end result of which is, for previous example,
cl1  1
cl1  1
cl1  1
cl1  1
cl1  1
cl1  1
cl2  1
cl2  1
cl2  2
cl2  2
cl2  2
cl2  2
cl3  2
cl3  2
cl3  1
cl3  1
cl3  3
cl3  1
i.e. a list for the three models, which class each individual has been assigned to. Piping this information into compsol by typing
extractsol < all_results | compsol 
will produce the comparison of different solutions:
COMPSOL : Compare L-POP solutions 

3 solutions and 6 individuals.
	Solution "cl1" has 1 classes.
	Solution "cl2" has 2 classes.
	Solution "cl3" has 3 classes.

cl2	1	2	
cl1	-------	-------	
1 |	2	4

cl3	2	1	3	
cl1	-------	-------	-------	
1 |	2	3	1

cl3	2	1	3	
cl2	-------	-------	-------	
1 |	2	0	0
2 |	0	3	1

RAND coefficients
	cl1	cl2	cl3	
	-------	-------	-------	
cl1 |	1.0000	0.4667	0.2667
cl2 |	0.4667	1.0000	0.8000
cl3 |	0.2667	0.8000	1.0000

RAND (adj) coefficients
	cl1	cl2	cl3	
	-------	-------	-------	
cl1 |	nan	0.0000	0.0000
cl2 |	0.0000	1.0000	0.5872
cl3 |	0.0000	0.5872	1.0000
The top three tables are contingency tables giving the frequency of individuals cross-classified by the two solutions. For example, the third table comparing solutions cl2 and cl3 indicates that class 2 in cl3 is equivalent to class 1 in cl2: the difference between the two solutions is the splitting of class 2 in cl2 to produce class 1 and 3 in cl3.

The RAND coefficients are measure of solution similarity. The adjusted RAND index is probably the more reliable (note: it is not defined for solutions with only a single class).
 
Links & Misc.
 
grep, gawk and other useful utilities for MS-DOS are available on the web for free. In particular, the Cygwin project (http://www.cygwin.com/) provides a comprehensive Unix-like environment for MS Windows. Also http://garbo.uwasa.fi/pc/unix.html may be of use.

ToDo List