PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

Basic usage / data formats

PLINK is a command line program written in C/C++. All commands involve typing plink at the command prompt (e.g. DOS window or Unix terminal) followed by a number of options (all starting with --option) to specify the data files / methods to be used. All results are written to files with various extensions. The name of the file is by default plink.ext where .ext will change depending on the content of the file. Often these files will be large: using a package such as R is suggested for visualising and tabulating output. The majority of output files are in a standard plain text 'rectangular' format, with one header row and a fixed number of columns per line. A complete list of all options and output file types is given in the reference section

Running PLINK

PLINK is a command-line program: clicking on an icon will get you nowhere: please consult these notes on downloading and installing PLINK. Open up a command prompt or terminal window and perform all analyses by typing commands as described below.
plink --file mydata

where we expect two files: in this case, mydata.ped and mydata.map.

When PLINK starts it will attempt to contact the web, to check whether there is a more up-to-date version available or not. After checking, PLINK writes a file called .pversion to the working directory and use this cached information for the rest of the day. This option can be disabled with the --noweb option on the command line. When using PLINK on a machine with no, or a very slow, web connection, it may be desirable to turn this feature off. This feature is turned on by default so that users are aware of new versions that may contain important new features or bug fixes. If your current version of PLINK is out of date, then a warning message will be displayed, suggesting that you download and install the current version. (This is the only reason the web connection is made -- no other data is transmitted to the server.) If the current version is up-to-date, you will see something like the following:
     Web-based version check ( --noweb to skip )
     Connecting to web...  OK, v1.04 is current
whereas, if the current version is not up-to-date, you will see something like the following:
     Web-based version check ( --noweb to skip )
     Connecting to web...
 
               *** UPDATE REQUIRED ***
 
             This version        : 1.03
             Most recent version : 1.04
 
     Please upgrade your version of PLINK as soon as possible!
       (visit the above website for free download)
 
     Old versions of PLINK (<1.04) contain bugs fixed in 1.04
The web-based version check will also produce warning if an command used was found to have some issue discovered since that version was released (the warning will contain a link to a web page describing the issue).

To re-run a previous job, use the --rerun option, which takes a PLINK LOG file as the parameter. This option will scan the LOG file, extract the previous PLINK commands and re-execute them. If new commands are added to the command line, they will also be included; if the command also appeared in the original file, any parameters will be taken from the newer version. For example, if the original command was
plink --file mydata --pheno pheno.raw --assoc --maf 0.05 --out run1

then the command
plink --rerun run1.log --maf 0.1

would repeat the analysis but with the new minor allele frequency threshold of 0.1, not 0.05. Note that commands in the old LOG file can be overwritten but not removed with the rerun command.

Note By default, the --out statement would also be copied, and so the new output would overwrite any old results (i.e. with the run1 fileroot). It is often a good idea to also add a new --out command, therefore:
plink --rerun run1.log --maf 0.1 --out run2

For very long a complex commands, --rerun can save typing and help reduce mistakes.

HINT MS-DOS only allows command lines to be 127 characters in length -- sometimes, PLINK command lines can grow longer than this. In this case, use the --script option, where the remaining options will be read from a text file. For example,
plink --script myscript1.txt

where the file myscript1.txt is a plain text file containing
--ped ..\data\version1\50K\allsamples.ped 
--map ..\data\allmapfiles\finalversion\autosomal.map 
--out ..\results\working\sample-missingness-v1.22
--from rs66537222
--to rs8837323 
--geno 0.25
--maf 0.02
--missing

would be the same as typing all these options in at the command line (note that the commands do not need to be all on the same line now). Another advantage of using script files is that it aids attempts at making one's research reproducible.

PED files

As well as the --file command described above, PED and MAP files can be specified separately, if they have different names:
plink --ped mydata.ped --map autosomal.map

 

Note Loading a large file (100K+ SNPs) can take a while (which is why we suggest converting to binary format). PLINK will give an error message in most circumstances when something has gone wrong.

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:
     Family ID
     Individual ID
     Paternal ID
     Maternal ID
     Sex (1=male; 2=female; other=unknown)
     Phenotype
The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a person. A PED file must have 1 and only 1 phenotype in the sixth column. The phenotype can be either a quantitative trait or an affection status column: PLINK will automatically detect which type (i.e. based on whether a value other than 0, 1, 2 or the missing genotype code is observed).

NOTE Quantitative traits with decimal points must be coded with a period/full-stop character and not a comma, i.e. 2.394 not 2,394

If an individual's sex is unknown, then any character other than 1 or 2 can be used. When new files are created (PED, FAM, or other which contain sex) then the original coding will be preserved. However, these individuals will be dropped from any analyses (i.e. phenotype set to missing also) and an error message will arise if an analysis that uses family information is requested and an individual of 'unknown' sex is specified as a father or mother.

HINT To disable the automatic setting of the phenotype to missing if the individual has an ambiguous sex code, add the --allow-no-sex option. When using a data generation command (e.g. --make-bed, --recode, etc) as opposed to an analysis command, then by default the phenotype is not set to missing is sex is missing. This behaviour can be changed by adding the flag --must-have-sex.

HINT You can add a comment to a PED or MAP file by starting the line with a # character. The rest of that line will be ignored. Do not start any family IDs with this character therefore.

Affection status, by default, should be coded:
    -9 missing 
     0 missing
     1 unaffected
     2 affected
If your file is coded 0/1 to represent unaffected/affected, then use the --1 flag:
plink --file mydata --1

which will specify a disease phenotype coded:
     -9 missing
      0 unaffected
      1 affected
The missing phenotype value for quantitative traits is, by default, -9 (this can also be used for disease traits as well as 0). It can be reset by including the --missing-phenotype option:
plink --file mydata --missing-phenotype 99

Other phenotypes can be swapped in by using the --pheno (and possibly --mpheno) option, which specify an alternate phenotype is to be used, described below.

Genotypes (column 7 onwards) should also be white-space delimited; they can be any character (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0 which is, by default, the missing genotype character. All markers should be biallelic. All SNPs (whether haploid or not) must have two alleles specified. Either Both alleles should be missing (i.e. 0) or neither. No header row should be given. For example, here are two individuals typed for 3 SNPs (one row = one person):
     FAM001  1  0 0  1  2  A A  G G  A C 
     FAM001  2  0 0  1  2  A A  A G  0 0 
     ...
The default missing genotype character can be changed with the --missing-genotype option, for example:
plink --file mydata --missing-genotype N

NOTE Different values to the missing phenotype or genotype code can be specified for output datasets created, with --output-missing-phenotype and --output-missing-genotype.

Different PED file formats: missing fields
Sometimes data arrive in a number of different formats: for example, where the genotype information just has a single ID column followed by all the SNP data, with the other family and phenotype information residing in a separate file. Rather than have to recreate new files, it is sometimes possible to read in such files directly. The standard behavior of PLINK when reading a PED file with --file or --ped can be modified to allow for the fact that one or more of the normally obligatory 6 fields are missing:
--no-fid

indicates there is no Family ID column: here the first field is taken to be individual ID, and the family ID is automatically set to be the same as the individual ID (i.e. obviously, all individuals would be treated as unrelated). In other files that require family and individual ID (e.g. alternate phenotype file and cluster files, for which this flag has no effect), the individual ID would need to be entered also as the family ID therefore.
--no-parents

indicates that there are no paternal and maternal ID codes; all individuals would be assumed to be founders in this case
--no-sex

indicates that there is no sex field; all individuals set to have a missing sex code (which also sets that individual to missing unless the allow-no-sex option is also used)
--no-pheno

indicates that there is no phenotype filed; all individuals are set to missing unless an alternate phenotype file is specified.

It is possible to use these flags together, so using all of them would specify the most simple kind of file mentioned above: a single, unique ID code followed by all genotype data.

IMPORTANT These options only work for the basic PED file (i.e. specified by --file or --ped. They do not work for transposed files, when merging in a file with --merge, or with binary filesets or covariate, cluster or alternate phentype files.

If the genotype codes in a PED file are in the form AG rather than A G, for example, such that every genotype is exactly two characters long, then then flag
./plink --file mydata --compound-genotypes

can be added. Note that this only works for input for PED files (not TPED or LGEN files, and not for any output options, e.g. --recode, etc).

Note To load the PED file from the standard input stream instead of a file, use the - symbol as the file name, e.g.
perl retrieve_data.pl | ./plink --ped - --map mymap.map --make-bed

The MAP file still needs to be a normal file; this currently only works for --ped files.

MAP files

By default, each line of the MAP file describes a single marker and must contain exactly 4 columns:
     chromosome (1-22, X, Y or 0 if unplaced)
     rs# or snp identifier
     Genetic distance (morgans)
     Base-pair position (bp units)
Genetic distance can be specified in centimorgans with the --cm flag. Alternatively, you can use a MAP file with the genetic distance excluded by adding the flag --map3, i.e.
plink --file mydata --map3

In this case, the three columns are expected to be
     chromosome (1-22, X, Y or 0 if unplaced)
     rs# or snp identifier
     Base-pair position (bp units)
Base-pair positions are expected to correspond to positive integers within the range of typical human chromosome sizes.

Note Most analyses do not require a genetic map to be specified in any case; specifying a genetic (cM) map is most crucial for a set of analyses that look for shared segments between individuals. For basic association testing, the genetic distance column can be set at 0.

SNP identifers can contain any characters except spaces or tabs; also, you should avoid * symbols in names also.

To exclude a SNP from analysis, set the 4th column (physical base-pair position) to any negative value (this will only work for MAP files, not for binary BIM files).

     1  rs123456  0  1234555
     1  rs234567  0  1237793
     1  rs224534  0  -1237697        <-- exclude this SNP
     1  rs233556  0  1337456
     ...
The MAP file must therefore contain as many markers as are in the PED file. The markers in the PED file do not need to be in genomic order: (i.e. the order MAP file should align with the order of the PED file markers).
Chromosome codes
The autosomes should be coded 1 through 22. The following other codes can be used to specify other chromosome types:
     X    X chromosome                    -> 23
     Y    Y chromosome                    -> 24
     XY   Pseudo-autosomal region of X    -> 25
     MT   Mitochondrial                   -> 26
The numbers on the right represent PLINK's internal numeric coding of these chromosomes: these will appear in all output rather than the original chromosome codes.

For haploid chromosomes, genotypes should be specified as homozygotes: for most analyses, PLINK will treat these appropriately. For example, consider the following example PED file, containing two males (1 and 2) and two females (3 and 4):
     1 1 0 0 1   1   A A    A A    A A    A A    A A
     2 1 0 0 1   1   A C    A C    A C    A C    A C
     3 1 0 0 2   1   A A    A A    A A    A A    A A
     4 1 0 0 2   1   A C    A C    A C    A C    A C
and MAP file
     1    snp1   0   1000
     X    snp2   0   1000
     Y    snp3   0   1000
     XY   snp4   0   1000
     MT   snp5   0   1000
Generating frequencies for these SNPs,
plink --file test --freq

we see plink.frq is
      CHR          SNP   A1   A2          MAF       NM
        1         snp1    C    A         0.25        8
       23         snp2    C    A          0.2        5
       24         snp3    C    A            0        1
       25         snp4    C    A         0.25        8
       26         snp5    C    A            0        2
There are several things to note. First, the numeric chromosome codes are used in the output to represent X, Y, XY and MT. Second, haploid chromosomes are only counted once (i.e. male X and Y chromosome SNPs and all MT SNPs). Third, several genotypes have been set to missing if they are not valid (female Y genotype, heterozygous haploid chromosome). The NM field represents the number of non-missing alleles for each SNP -- this is because invalid genotypes are automatically set to missing.

We can see which genotypes have been set to missing by running the --recode command; however, usually PLINK preserves all genotypes when generating a new file (i.e. if one is just reformatting a file, say from text to binary format, it is not necessarily desirable to change any of the content; as above, summary statistic and analysis commands do set these genotypes missing automatically still). However, if we also add the --set-hh-missing flag, any invalid genotypes will be set to missing in the new file:
plink --file test --set-hh-missing

which creates the new PED file plink.recode.ped
     1 1 0 0 1 1 A A A A A A A A A A
     2 1 0 0 1 1 C A 0 0 0 0 C A 0 0
     3 1 0 0 2 1 A A A A 0 0 A A A A
     4 1 0 0 2 1 C A C A 0 0 C A 0 0
In other words, the actual alleles that PLINK pays attention to are shown in bold, all non-bold alleles are ignored.
     1 1 0 0 1   1   A A    A A    A A    A A    A A
     2 1 0 0 1   1   A C    A C    A C    A C    A C
     3 1 0 0 2   1   A A    A A    A A    A A    A A
     4 1 0 0 2   1   A C    A C    A C    A C    A C
Allele codes
By default, the minor allele is coded A1 and the major allele is coded A2 (this is used in many output files, e.g. from --freq or --assoc). By default this is based on all founders (unless --nonfounders is added) with sex-codes specified (unless --allow-no-sex is added). This coding is applied after any other filters have been applied. It is sometimes desirable to prevent this automatic flipping of A1 and A2 alleles, by use of the --keep-allele-order option. For example, if one wishes to dump the genotype counts by use of the --model command, for two groups of individuals (using the --filter command), this ensures that the same minor allele will always be used in grp1.model as grp2.model (which can facilitate downstream processing of these files, for instance).
plink --bfile --filter pop.dat POP1 --model --keep-allele-order --out pop-1-genotypes
plink --bfile --filter pop.dat POP2 --model --keep-allele-order --out pop-2-genotypes
That is, for any SNP that happens to have a different minor allele in POP1 versus POP2, the output in the two .model files will still line up in an easy manner.

Transposed filesets

Another possible file-format called a transposed fileset, containing two text files: one (TPED) containing SNP and genotype information where one row is a SNP; one (TFAM) containing individual and family information, where one row is an individual.

The first 4 columns of a TPED file are the same as a standard 4-column MAP file. Then all genotypes are listed for all individuals for each particular SNP on each line. The TFAM file is just the first six columns of a standard PED file. In otherwords, we have just taken the standard PED/MAP file format, but swapped all the genotype information between files, after rotating it 90 degrees. For each, the above example PED/MAP fileset
     <---- normal.ped ---->                  <--- normal.map --->
     1 1 0 0 1  1  A A  G T                  1  snp1   0  5000650
     2 1 0 0 1  1  A C  T G                  1  snp2   0  5000830
     3 1 0 0 1  1  C C  G G
     4 1 0 0 1  2  A C  T T
     5 1 0 0 1  2  C C  G T
     6 1 0 0 1  2  C C  T T
would be represented as TPED/TFAM files:
     <------------- trans.tped ------------->      <- trans.tfam ->
     1 snp1 0 5000650 A A A C C C A C C C C C      1  1  0  0  1  1
     1 snp2 0 5000830 G T G T G G T T G T T T      2  1  0  0  1  1
                                                   3  1  0  0  1  1
                                                   4  1  0  0  1  2
                                                   5  1  0  0  1  2
                                                   6  1  0  0  1  2
This kind of format can be convenient to work with when there are very many more SNPs than individuals (i.e. WGAS data). In this case, the TPED file will be very long (as opposed to the PED file being very wide).

To read a transposed fileset, use the command
plink --tfile mydata

which implies mydata.tped and mydata.tfam exists; alternatively, if the files are differently named, they can be individually, fully specified:
plink --tped mydata.tped --tfam pedinfo.txt

HINT You can generate transposed filesets with the --transpose option, described in the data management section

Long-format filesets

Another possible file-format called a long-format fileset, containing three text files:
  • a LGEN file containing genotypes (5 columns, one row per genotype)
  • a MAP file containing SNPs (4 columns, one row per SNP)
  • a FAM file containing individuals (6 columns, one row per person)
The MAP and FAM/PED files are described elsewhere this page. Consider the following example: A MAP file test.map
     1 snp2 0 2
     2 snp4 0 4
     1 snp1 0 1
     1 snp3 0 3
     5 snp5 0 1
as described above. A FAM file test.fam
     1 1 0 0 1 2
     2 1 0 0 2 2
     2 2 0 0 1 1
     9 1 1 2 0 0
as described below. Finally, an LGEN file, test.lgen
     1 1 snp1 A A
     1 1 snp2 A C
     1 1 snp3 0 0
     2 1 snp1 A A
     2 1 snp2 A C
     2 1 snp3 0 0
     2 1 snp4 A A
     2 2 snp1 A A
     2 2 snp2 A C
     2 2 snp3 0 0
     2 2 snp4 A A
The columns in the LGEN file are
     family ID
     individual ID
     snp ID
     allele 1 of this genotype
     allele 2 of this genotype
Not all entries need to be present in the LGEN file (e.g. snp5 or person 9/1) or snp4 for person 1/1. These genotypes will be set to missing internally. The order also need not be the same in the LGEN file as for the MAP or FAM files. If a genotype is listed more than once, the final version of it will be used.

LGEN file can be reformatted as a standard PED file using the following command:
plink --lfile test --recode

which creates these two files: a PED file, plink.recode.map
     1 1 0 0 1  2   A A  A C  0 0  0 0  0 0
     2 1 0 0 2  2   A A  A C  0 0  A A  0 0
     2 2 0 0 1  1   A A  A C  0 0  A A  0 0
     9 1 1 2 0  0   0 0  0 0  0 0  0 0  0 0
and the MAP file, plink.recode.map (note: it has been put in genomic order)
     1       snp1    0       1
     1       snp2    0       2
     1       snp3    0       3
     2       snp4    0       4
     5       snp5    0       1

NOTE All individuals must be uniquely identified by the combination of the family and individual IDs.

To read a long-format fileset, use the command
plink --lfile mydata

which implies mydata.lgen, mydata.map and mydata.map exist.

NOTE Currently, you cannot output a fileset in this format in PLINK.
Additional options for long-format files
If the LGEN file has specific allele codes, but as TG instead of T G (i.e. no spaces between the two alleles), add the flag
     --compound-genotypes
It is possible to specify the reference allele with the --reference command when using long-format file input. This might be appropriate, for example, if the data file contains calls for rare variants from a resequencing study. In this case, the majority of alleles will be the reference, and so need not be repeated here. For example, consider this FAM file f1.fam
    1 1 0 0 1 1
    2 1 0 0 1 1
    3 1 0 0 1 1
    4 1 0 0 1 1
    5 1 0 0 1 1
    6 1 0 0 1 1
and MAP file f1.map
    1       rs0001    0       1000001
    1       rs0002    0       1000002
    1       rs0003    0       1000003
and LGEN file f1.lgen
    1 1 rs0001 C C
    2 1 rs0001 0 0
    6 1 rs0003 C C
    1 1 rs0002 G T
    4 1 rs0002 T T
    5 1 rs0002 G T
then
plink --lfile f1 --recode

would yield a file plink.ped that is as follows:
     1 1 0 0 1 1  C C  G T  0 0
     2 1 0 0 1 1  0 0  0 0  0 0
     3 1 0 0 1 1  0 0  0 0  0 0
     4 1 0 0 1 1  0 0  T T  0 0
     5 1 0 0 1 1  0 0  G T  0 0
     6 1 0 0 1 1  0 0  0 0  C C     
If the reference all for each variant was set, e.g. with the following command
plink --lfile f1 --reference ref.txt --recode

and the file ref.txt is
    rs0001 A
    rs0002 G
    rs0009 T
then the output plink.ped will instead read:
     1 1 0 0 1 1  C C  T G  0 0
     2 1 0 0 1 1  0 0  G G  0 0
     3 1 0 0 1 1  A A  G G  0 0
     4 1 0 0 1 1  A A  T T  0 0
     5 1 0 0 1 1  A A  T G  0 0
     6 1 0 0 1 1  A A  G G  C C
That is, the non-specified genotypes for the first two SNPs are now homozygous for the reference allele. Note: the word reference is used in the context of the human genome reference allele, rather than for the calculation of an odds ratio. The command to set the latter is --reference-allele {file}

Also note in this example, that a) when an individual is set as explicitly missing in the LGEN file, they stay missing, b) that when a reference allele is not set, then non-specified genotypes are missing (e.g. the third SNP, rs0003), c) that SNPs in the reference file that are not present in the dataset (e.g. rs0009) are ignored.

When reading a long-format file, the command
     --allele-count
when specified along with --reference allows the data to be in the form of the number of non-reference alleles. For example, if input LGEN file were
    1 1 rs0001 0
    2 1 rs0001 1 
    3 1 rs0001 2 
    4 1 rs0001 -1  
    5 1 rs0001 9 
    6 1 rs0001 X 
this should translate into the first three individuals having the reference homozygote (0 non-reference alleles), the heterozygote (1 non-reference allele) and the non-reference homozygote (2 non-reference alleles). The final three individuals (FID 4 to 6) are all set to missing: this just indicates that any value other than a 0, 1 or 2 under this scheme is set to a missing genotype. If the reference file only contains a single allele for that SNP, then the non-reference allele is coded as whatever is in the reference allele plus a v character appended, e.g. just considering this one SNP:
      1 1 0 0 1 1   A  A
      2 1 0 0 1 1   A  Av
      3 1 0 0 1 1   Av Av
      4 1 0 0 1 1   0  0
      5 1 0 0 1 1   0  0
      6 1 0 0 1 1   0  0
However, if the reference file contains two alleles, then the second is taken to be the non-reference allele, e.g. if ref.txt is
   rs0001 A  G
then the output will read
     1 1 0 0 1 1 A A
     2 1 0 0 1 1 A G
     3 1 0 0 1 1 G G
     4 1 0 0 1 1 0 0
     5 1 0 0 1 1 0 0
     6 1 0 0 1 1 0 0

Binary PED files

To save space and time, you can make a binary ped file (*.bed). This will store the pedigree/phenotype information in separate file (*.fam) and create an extended MAP file (*.bim) (which contains information about the allele names, which would otherwise be lost in the BED file). To create these files use the command:
plink --file mydata --make-bed

which creates (by default)
     plink.bed      ( binary file, genotype information )
     plink.fam      ( first six columns of mydata.ped ) 
     plink.bim      ( extended MAP file: two extra cols = allele names)
The .fam and .bim files are still plain text files: these can be viewed with a standard text editor. Do not try to view the .bed file however: it is a compressed file and you'll only see lots of strange characters on the screen...

NOTE Do not make any changes any of these three files; e.g. setting the position to a negative value will not work to exclude a SNP for binary files

You can specify a different output root file name (i.e. different to "plink") by using the --out option:
plink --file mydata --out mydata --make-bed

which will create
     mydata.bed
     mydata.fam
     mydata.bim
To subsequently load a binary file, just use --bfile instead of --file
plink --bfile mydata

When creating a binary ped file, the MAF and missingness filters are set to include everybody and all SNPs. If you want to change these, use --maf, --geno, etc, to manually specify these options: for example,
plink --file mydata --make-bed --maf 0.02 --geno 0.1

More information... If you want to write your own software that uses the BED file format, please follow this link for more information of the specification.

Alternate phenotype files

To specify an alternate phenotype for analysis, i.e. other than the one in the *.ped file (or, if using a binary fileset, the *.fam file), use the --pheno option:
plink --file mydata --pheno pheno.txt

where pheno.txt is a file that contains 3 columns (one row per individual):
     Family ID
     Individual ID
     Phenotype
The original PED file must still contain a phenotype in column 6 (even if this is a dummy phenotype, e.g. all missing), unless the --no-pheno flag is given.

If an individual is in the original file but not listed in the alternate phenotype file, that person's phenotype will be set to missing. If a person is in the alternate phenotype file but not in the original file, that entry will be ignored. The order of the alternate phenotype file need not be the same as for the original file. If the phenotype file contains more than one phenotype, then use the --mpheno N option to specify the Nth phenotype is the one to be used:
plink --file mydata --pheno pheno2.txt --mpheno 4

where pheno2.txt contains 5 different phenotypes (i.e. 7 columns in total), this command will use the 4th for analysis (phenotype D):
     Family ID
     Individual ID
     Phenotype A
     Phenotype B
     Phenotype C
     Phenotype D
     Phenotype E
Alternatively, your alternate phenotype file can have a header row, in which case you can use variable names to specify which phenotype to use. If you have a header row, the first two variables must be labelled FID and IID. All subsequent variable names cannot have any whitespace in them. For example,
     FID    IID      qt1   bmi    site  
     F1     1110     2.3   22.22  2     
     F2     2202     34.12 18.23  1     
     ...
then
plink --file mydata --pheno pheno2.txt --pheno-name bmi --assoc

will select the second phenotype labelled "bmi", for analysis

Finally, if there is more than one phenotype, then for basic association tests, it is possible to specify that all phenotypes be tested, sequentially, with the output sent to different files: e.g. if bigpheno.raw contains 10,000 phenotypes, then
plink --bfile mydata --assoc --pheno bigpheno.raw --all-pheno

will loop over all of these, one at a time testing for association with SNP, generating a lot of output. You might want to use the --pfilter command in this case, to only report results with a p-value less than a certain value, e.g. --pfilter 1e-3.

WARNING Currently, all phenotypes must be numerically coded, including missing values, in the alternate phenotype file. The default missing value is -9, change this with --missing-phenotype, but it must be a numeric value still (in contrast to the main phenotype in the PED/FAM file).
Creating a new binary phenotype automatically
To automatically form a one-versus-others binary phenotype (note: binary meaning dichotomous here, rather than a BED/binary-PED file) from a categorical covariate/phenotype file, use the command
plink --bfile mydata --make-pheno site.cov SITE3 --assoc

which assumes the file
     site.cov
contains exactly three fields
     Family ID
     Individual ID
     Code from which phenotype is created
For example, if it were
     A1  1  SITE1
     B1  1  SITE1
     C1  1  SITE2
     D1  1  SITE3
     E1  1  SITE3
     F1  1  SITE4
     G2  1  SITE4
then the above command would make individuals D1 and E1 as cases and everybody else as controls. However, if individuals present in mydata were not specified in site.cov, then these people would be set to have a missing phenotype.

An alternate specification is to use the * symbol instead of a value, e.g.
plink --bfile mydata --make-pheno p1.list * --assoc

which assumes the file
     p1.list
contains exactly two fields
     Family ID
     Individual ID
In this case, anybody in the file p1.list would be made a case; all other individuals in mydata but not in p1.list would be set as a control.
"Loop association": automatically testing each group versus all others
You may have a categorical factor that groups individuals (e.g. which plate they were genotyped on, or which sample they come from) and want to test whether there are allele frequency differences between each group and all others. This can be accomplished with the --loop-assoc command, e.g.
./plink --bfile mydata --loop-assoc plate.lst --assoc

The file plate.lst should be in the same format as a cluster file, although it is only allowed to have a single variable (i.e. 3 columns, FID, IID and the cluster variable). If this were
   10001  1   P1
   10002  1   P1
   10003  1   P2
   10004  1   P2
   10005  1   P3   
   10006  1   P3
   ...
This command would test all P1 individuals against all others, then all P2 individuals against all others, etc. Any of the main single SNP association tests for diseases can be supplied instead of --assoc (e.g. --fisher, --test-missing, --logistic, etc). The output is written to different files for each group, e.g. in the format outputname.{label}.extension
     plink.P1.assoc
     plink.P2.assoc
     plink.P3.assoc
     ...

Covariate files

Certain PLINK commands support the inclusion of one or more covariates. Note that for stratified analyses, namely using the CMH (--mh) options, the strata are specified using the --within option to define clusters, rather than --covar.

To load a covariate use the option:
plink --file mydata --covar c.txt

The covariate file should be formatted in a similar manner to the phenotype file. If an individual is not present in the covariate file, or if the individual has a missing phenotype value (i.e. -9 by default) for the covariate, then that individual is set to missing (i.e. will be excluded from association analysis).

To select a particular subset of covariates, use one of the following commands, which either use numbers or names (i.e. if a header row exists in the file),
plink --file mydata --covar c.txt --covar-number 2,4-6,8

or
plink --file mydata --covar c.txt --covar-name AGE,BMI-SMOKE,ALC

Note that ranges can be used in both cases, with the - hyphen symbol, e.g. if the first row were
     FID IID SITE AGE DOB BMI ETH SMOKE STATUS ALC 
then both the above commands would have the same effect, i.e. selecting AGE, BMI, ETH, SMOKE, ALC.

To output a new covariate file, possibly with categorical variables downcoded to binary dummy variables use the --write-covar option as described here

Exception If the --gxe command is used, that selects only a single covariate, then use the command --mcovar, that works similarly to --mpheno to select which single covariate to use: with the --gxe command, the --covar-name and --covar-number options will not work.

NOTE Not all commands accept covariates, and PLINK will not always give you an error or warning. The basic association (--assoc, --mh, --model, --tdt, --dfam, and --qfam) do not accept covariates, neither do the basic haplotype association methods (--hap-assoc, --hap-tdt). Among the commands that do are --linear, --logistic, --chap and --proxy-glm. Also --gxe accepts a single covariate only (the others listed here accept multiple covariates).

Cluster files

To load a cluster solution, or indeed any categorical grouping of the sample, use the --within option:
plink --file mydata --within f.txt

If this option is used, then permutation procedures will permute within-cluster only, effectively controlling for any effect of cluster membership. Similarly, tests that perform stratified analyses, such as the Cochran-Mantel-Haenszel, this option is used to define the strata.

This file should have a similar structure to the alternate phenotype file. The clusters can be coded either numerically or as strings:
     F1 I1  A
     F2 I1  B
     F3 I1  B
     F4 I1  C1
     F5 I1  A
     F6 I1  C2
     F7 I1  C2
     ...
Here, individuals would be grouped in four groups:
     Cluster A:  F1/I1  F5/I1
     Cluster B:  F2/I1  F3/I1
     Cluster C1: F4/I1  
     Cluster C2: F6/I1  F7/I1
     ...
All individuals in the file should be assigned to a single cluster in the cluster file.

Set files

Certain analyses (e.g. set based tests) require sets of SNPs to be specified. This is performed by including the --set option on the command line, followed by a filename that defines the sets. The file mydata.set should be in the following format:
SET_A 
rs10101
rs20234
rs29993
END

GENE-B
rs2344
rs888833
END
That is, each set must start with a set name (e.g. SET_A), which might be a gene name, for example. This name can not have any spaces in it. The name is followed by a list of SNPs in that set. The keyword END specifies the end of that particular set. Do not name any SNPs to have the name END!

Sets can be overlapping. Any SNPs specified in the set that do not appear in the actual data, or that have been excluded due to filters used, will be ignored.

The format is flexible in terms of whether each item appears on one line: the set file only needs to be whitespace delimited. For example, the file above could be specified as:
SET_A    rs10101 rs20234 rs29993 END
GENE-B   rs2344 rs888833 END

HINT It is possible to automatically create a set-file, given a list of genomic co-ordinates, using the --make-set command, described here.

To extract a subset of sets from a set file, use the --subset command in addition to --set. For example,
--set mydata.set --subset extract.txt

where extract.txt is a text file with the set names you wish to extract, e.g. SET_A or GENE-B in this example.
 

This document last modified Friday, 12-Mar-2010 12:50:48 EST