PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.03 (10-Jun-2008)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | Haplotypes | Conditional tests | Proxy association | Imputation | Clumping | Epistasis | Copy Number Variation | R-plugins | SNP annotation | Simulation | Profiles | Resources | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. Multimarker tests 15. Conditional haplotype tests 16. Proxy association 17. Full imputation (beta) 18. LD-based results clumping 19. Epistasis 20. Copy Number Variation 21. R-plugins 22. SNP annotation lookup 23. Simulation tools 24. Profile scoring 25. Resources 26. Miscellaneous 27. FAQ & Hints

28. gPLINK
 

Data management tools

PLINK provides a simple interface for recoding, reordering, merging, flipping DNA-strand and extracting subsets of data.

Recode and reorder a sample

A basic, but often useful feature, is to output a dataset:
  1. with the PED file markers reordered for physical position,
  2. with excluded SNPs (negative values in the MAP file) excluded from the new PED file
  3. possibly excluding other SNPs based on filters such as genotyping rate
  4. possibly recoding the SNPs to a 1/2 coding
  5. possibly recoding the SNPs between letters and numbers (A,C,G,T / 1,2,3,4)
  6. possibly transposing the genotype file (SNPs as rows)
  7. possibly recoding the SNP to an additive and dominant pair of components
  8. possibly listing the data with each specific genotype as a distinct row
The basic option to generate a new dataset is the --recode option:
plink --file data --recode

which will output the allele labels as they appear in the original; also, the missing genotype code is preserved if this is different from 0.

The --make-bed option does the same as --recode but creates binary files; these can also be filtered, etc, as described below.

In contrast,

plink --file data --recode12

will recode the alleles as 1 and 2 (and the missing genotype will always be 0).

Both these commands will create two new files
     plink.recode.ped
     plink.recode.map
(where, as usual, "plink" would be replaced by any specified --out {filename} ).

Unless manually specified, for all these options, the usual filters for missingness and allele frequency will be set so as not to exclude any SNPs or individuals. By explicitly including an option, e.g. --maf 0.05 on the command line, this behaviour is overriden (see this page).

By default, any --recode option, and also --make-bed will preserve all genotypes exactly as they are. To set to missing Mendel errors or heterozygous haploid calls, use the options --set-me-missing and --set-hh-missing respectively. For the former, you will also need to specify --me 1 1 (i.e. to invole an evalation of Mendel errors, which does not occur by default, by not excluding any individuals or SNPs based on the results, i.e. if you only want to zero-out certain genotypes).

To recode SNP alleles from A,C,G,T to 1,2,3,4 or vice versa, use --allele1234 (to go from letters to numbers) and --alleleACGT (to go from numbers to letters). These flags should be used in conjunction with a data generation command (e.g. --make-bed), or any other analysis or summary statistic option. Alleles other than A,C,G,T or 1,2,3,4 will be left unchanged.

It is sometimes useful to have a PED file that is tab-delimited, except that between alleles of the same genotype a space instead of a tab is used. A file formatted in this way can load into Excel, for example, as a tab-delimited file, but with one genotype per column instead of one allele per column. Use the option --tab as well as --recode or --recode12 to achieve this effect.

To make a new file in which non-founders without both parents also in the same fileset are recoded as founders (i.e. pat and mat codes set both to 0), add the --make-founders flag.
Transposed genotype files
When using either --recode or --recode12, you can obtain a transposed text genotype file by adding the --transpose option. This generates two files:
     plink.recode.tped
     plink.recode.fam
The first contains the genotype data, with SNPs as rows and individuals as columns, for example: if the original file was
     1 1 0 0 1  1  1 1  G G
     1 2 0 0 2  1  0 0  A G
     1 3 0 0 1  1  1 1  A G
     1 4 0 0 2  1  2 1  A A
then this would generate
     1 snp1 0 10001  1 1  0 0  1 1  2 1
     1 snp2 0 20001  G G  G A  G A  A A
The first four columns are from the MAP file (chromosome, SNP ID, genetic position, physical position), followed by the genotype data. The plink.recode.fam gives the ID, sex and phenotype information for each individual. The order of individuals in this file is the same as the order across the columns of the TPED file. The FAM file is just the first six columns of the PED file (or literally the same FAM file if the input where a binary fileset).
Additive and dominance components
The following format is often useful if one wants to use a standard, non-genetic statistical package to analyse the data, as here genotypes are coded as a single allele dosage number. To create a file with SNP genotypes recoded in terms of additive and dominant components, use the option:
plink --file data --recodeAD

which, assuming C is the minor allele, will recode genotypes as follows:
     SNP       SNP_A ,  SNP_HET
     ---       -----    -----
     A A   ->    0   ,   0
     A C   ->    1   ,   1
     C C   ->    2   ,   0
     0 0   ->   NA   ,  NA
In otherwords, the default for the additive recoding is to count the number of minor alleles per person. The --recodeAD option produces both an additive and dominance coding: use --recodeA instead to skip the SNP_HET coding.

The --recodeAD option saves the data to a single file
     plink.recode.raw
which has a header row indicating the SNP names (with _A and _HET appended to the SNP names to represent additive and dominant components, respectively).

For example, consider the following PED file, which has two SNPs:
     1 1 0 0 1  1  1 1  G G
     1 2 0 0 2  1  0 0  A G
     1 3 0 0 1  1  1 1  A G
     1 4 0 0 2  1  2 1  A A
Using the --recodeAD option generates the file plink-recode.raw:
     FID IID PAT MAT SEX PHENOTYPE snp1_2 snp1_HET snp2_G snp2_HET
     1 1 0 0 1 1  0  0   2 0
     1 2 0 0 2 1  NA NA  1 1
     1 3 0 0 1 1  0  0   1 1
     1 4 0 0 2 1  1  1   0 0
The column labels reflect the snp name (e.g. snp1) with the name of the minor allele appended (i.e. snp1_2 in the first instance, as 2 is the minor allele) for the additive component. The dominant component ( a dummy variable reflecting heterozygote state) is coded with the _HET suffix.

This file can be easily loaded into R: for example:
     d <- read.table("plink.recode.raw",header=T)
For example, for the first SNP, the individuals are coded 1/1, 0/0, 1/1 and 2/1. The additive count of the number of common (1) alleles is therefore: 2, NA, 2 and 1, which is reflected in the field snp1_2. The field snp1_HET is coded 1 for the fourth individual who is heterozygous -- this field can be used to model dominance effect of the allele.

The behavior of the --recodeA and --recodeAD commands can be changed with the --recode-allele command. This allows for the 0, 1, 2 count to reflect the number of a pre-specified allele type per SNP, rather than the number of the minor allele. This command takes as a single argument the name of a file that lists SNP name and allele to report, e.g. if the file recode.txt contained
     snp1   1
     snp2   A
then
plink --file data --recodeAD --recode-allele recode.txt

would now report in the LOG file
     Reading allele coding list from [ recode.txt ] 
     Read allele codes for 2 SNPs
and the plink.recode.raw file would read
     FID IID PAT MAT SEX PHENOTYPE snp1_1 snp1_HET snp2_A snp2_HET
     1 1 0 0 1 1   2  0   0 0
     1 2 0 0 2 1   NA NA  1 1
     1 3 0 0 1 1   2  0   1 1
     1 4 0 0 2 1   1  1   2 0
If the SNP is monomorphic, by default the allele code out will be 0 and all individuals will have a count of 0 (or NA). If an allele is specified in --recode-allele that is not seen in the data, similarly all individuals will receive a 0 count (i.e. rather than an error being given).

NOTE For alleles that have exactly 0.50 minor allele frequency, as for the second SNP in the example above, then which allele is labelled as minor will depend on which was first encountered in the PED file.

Listing by genotype
Another format that might sometimes be useful is the --list option which genetes a file
     plink.recode.list
that is ordered one genotype per row, listing all family and individual IDs of people with that genotype. For example, if we have a file with two SNPs rs1001 and rs2002 (both on chromosome 1):
     A 1 0 0 1  2  A A  1 1
     B 2 0 0 1  2  A C  0 0
     C 3 0 0 1  1  A C  1 2
     D 4 0 0 1  1  C C  1 2
then then option
plink --file mydata --list

will generate the file plink.recode.list
     1 rs1001 AA A 1
     1 rs1001 AC B 2 C 3
     1 rs1001 CC D 4
     1 rs1001 00
     1 rs2002 22
     1 rs2002 21 C 3 D 4
     1 rs2002 11 A 1
     1 rs2002 00 B 2
which has columns
     Chromosome
     SNP identifier
     Genotype
     Family ID, Individual ID for 1st person
     Family ID, Individual ID for 2nd person
     ...
     Family ID, Individual ID for final person
Obviously, different rows will have a different number of columns. Here, we see that individual A 1 has the A/A genotype for rs1001, etc. This option is often useful in conjunction with --snp, if you want an easy breakdown of which individuals have which genotypes.

Write SNP list files

To output just the list of SNPs that remain after all filtering, etc, use the --write-snplist command, e.g. to get a list of all high frequency, high genotyping-rate SNPs:
plink --bfile mydata --maf 0.05 --geno 0.05 --write-snplist

which generates a file
     plink.snplist
This file is simply a list of included SNP names, i.e. the same SNPs that a --recode or --make-bed statement would have produced in the corresponding MAP or BIM files.

Update SNP positions

To automatically update either the genetic or physical positions for some or all SNPs in a dataset, use the --update-map command, which takes a single parameter of a filename, e.g.
plink --bfile mydata --update-map build36.txt --make-bed --out mydata2

where, for example, the file build36.txt contains new physical positions for SNPs, based on dbSNP126/build 36, in the simple format of SNP/position per line, e.g.
     rs100001  1000202
     rs100002  6252678
     rs100003  7635353
     ...
To change genetic position (3rd column in map file) add the flag --update-cm as well as --update-map. There is no way to change chromosome codes using this command. Normally, one would want to save the new file with the changed positions, as in the example above, although one could combine other commands instead (e.g. association testing, etc) although the updated positions would then be lost (i.e. the changes are not automatically saved).

Not all SNPs need feature in the file supplied here -- these SNPs will keep there old position. If a SNP is listed more than once in this file, an error will be reported. Importantly, if this command changes the implied ordering of SNPs, a message will be written to the command line. Note, the order of SNPs will not be changed in the existing dataset with this command, only the positions. If the order has changed, then any command which relies on relative SNP positions (e.g. --hap-window, --homozyg, etc) should not be used on that dataset. In this case, it is necessary to save the file; then when reloading it, the SNPs will be automatically re-ordered upon reloading. If the LOG file does not show a message that the order of SNPs has changed, one need not worry.

Write covariate files

If a covariate file is specified along with any of the above --recode options or with --make-bed, then that covariate file will also be written, as plink.cov by default. This option is useful if the covariate file has a different number of individuals, or is ordered differently, to produce a set of covariate values that line up more easily with the newly-created genotype and phenotype files.
plink --file data --covar myfile.txt --recode

creates also plink.cov. If you want just to create a revised version of the covariate file, but without creating a new set of genotype files, then use the --write-covar option. This can be used in conjunction with filters, etc, to output, for example, only covariates for high-genotyping (99%) cases, as in this example:
plink --file data --write-covar myfile.txt --filter-cases --mind 0.01

will output just the relevant lines of myfile.txt to plink.cov, sorted to match the order of data.ped.

To also include phenotype information in the plink.cov file add the flag --with-phenotype. This can be useful, for example, when used in conjunction with --recodeA to generate the files needed to replicate an analysis in R (e.g. extracting the appropriate genotype data, and applying filters, etc).

Write cluster files

Similar to --write-covar, the --write-cluster will output the single selected cluster from the file specified by --within. Unlike covariate files, this allows string labels to be used.
plink --bfile mydata --within clst.dat --write-cluster --out mynewfile

Flip DNA strand for SNPs

This command will read the list of SNPs in the file list.txt and flip the strand for these SNPs, then save a new PED or BED fileset (i.e. by using either the --recode or --make-bed commands):
plink --file data --flip list.txt --recode

The list.txt should just be a simple list of SNP IDs, one SNP per line.

Flipping strand means changing alleles
   A -> T
   C -> G
   G -> C
   T -> A
so, for example, a A/C SNP will become a T/G; alternatively, a A/T SNP will become a T/A SNP (i.e. in this case, the labels remain the same, but whether the minor allele is A or T will still depend on strand).

HINT When merging two datasets, it is clearly very important that the two sets of SNPs are concordant in terms of positive or negative strand. Whereas some mismatches will be easy to spot as more than two alleles will be observed in the merged dataset, other instances will not be so easy to spot, i.e. for A/T and C/G SNPs.

Merge two filesets

To merge two PED/MAP files:
plink --file data1 --merge data2.ped data2.map --recode --out merge

The --merge option must be followed by 2 arguments: the name of the second PED file and the name of the second MAP file. A --recode (or --make-bed, etc) option is necessary to output the newly merged file; in this case, --out option will create the files merge-recode.ped and merge-recode.map.

The --merge option can also be used with binary PED files, either as input or output, but not as the second file: i.e.
plink --bfile data1 --merge data2.ped data2.map --make-bed --out merge

will create merge.bed, merge.fam and merge.bim, as the --make-bed option was used instead of the --recode option. Likewise, the data1.* files point to a binary PED file set.

If the second fileset (data2.*) were in binary format, then you must use --bmerge instead of --merge
plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed --out merge

which takes 3 parameters (the names of the BED, BIM and FAM files, in that order).

The two filesets can either overlap completely, partially, or not at all both in terms of markers and individuals. Imputed genotypes will be set to missing (i.e. if SNP_B is not measured in the first file, but it is in the second, then any individuals in the first file who are not also present in the second file will be set to missing for SNP_B.

By default, any existing genotype data (i.e. in data1.ped) will not be over-written by data in the second file (data2.ped). By specifying a --merge-mode this default behavior can be changed. The modes are:
     1    Consensus call (default)
     2    Only overwrite calls which are missing in original PED file
     3    Only overwrite calls which are not missing in new PED file
     4    Never overwrite
     5    Always overwrite mode
     6    Report all mismatching calls (diff mode -- do not merge)
     7    Report mismatching non-missing calls (diff mode -- do not merge)
The default (mode 1) behaviour is to call the merged genotype as missing if the original and new files contain different, non-missing calls; otherwise: i.e.
                                Merge mode
    data1.ped ,  data2.ped  ->  1    2    3    4    5    
    ---------    ---------      -----------------------
     0/0      ,   0/0       ->  0/0  0/0  0/0  0/0  0/0
     0/0      ,   A/A       ->  A/A  A/A  A/A  0/0  A/A
     A/A      ,   0/0       ->  A/A  A/A  A/A  A/A  0/0
     A/A      ,   A/T       ->  0/0  A/A  A/T  A/A  A/T
Modes 6 and 7 effectively provide a means for comparing two PED files -- no merging is performed in these cases; rather, a list of mismatching SNPs is written to the file
     plink.diff
They should also report the concordance rate in the LOG file, based on all SNPs that feature in both sets.

A warning will be given if the chromosome and/or physical position differ between the two MAP files.

NOTE Alleles must be exactly coded to match: that is, PLINK will not assume that a {1,2,3,4} SNP coding maps onto a {A,C,G,T} coding. You can use the --allele1234 and --alleleACGT commands prior to merging to convert datasets and then merge these consistently coded files (you cannot convert and merge on the fly, i.e. simply do putting --allele1234 on the command line along with --merge will not work: you need to use --allele1234 and --make-bed first).

Merge multiple filesets

To merge more than two standard and/or binary filesets, it is often more convenient to specify a single file that contains a list of PED/MAP and/or BED/BIM/FAM files and use the --merge-list option. Consider, for an extreme example, the case where each fileset contains only a single SNP, and that there are thousands of these files -- this option would help build a single fileset, in this case.

For example, consider we had 4 PED/MAP filesets (labelled fA.* through fD.*) and 4 binary filesets, labelled fE.* through fH.*). Then using the command
plink --file fA --merge-list allfiles.txt --make-bed --out mynewdata

would create the binary fileset
     mynewdata.bed
     mynewdata.bim
     mynewdata.fam
(alternatively, the --recode option could have been used instead of --make-bed to generate a standard ASCII PED/MAP fileset). In this case, the file allfiles.txt was a list of the to-be-merged files, one set per row:
     fB.ped fB.map
     fC.ped fC.map
     fD.ped fD.map
     fE.bed fE.bim fE.fam
     fF.bed fF.bim fF.fam
     fG.bed fG.bim fG.fam
     fH.bed fH.bim fH.fam

Important Each fileset must be on a line by itself: lines with two files are interpreted as PED/MAP filesets; lines with three files are interpreted as binary BED/BIM/FAM filesets. The files on a line must always be in this order (PED then MAP; BED then BIM then FAM)

Note In this case the first of the 8 files must be the starting file, i.e. associated with --file on the command line; this file only contains the 8-1 remaining files therefore. The final mynewdata.* files will contain information from all 8 files.

The --merge-mode option can also be used with the --merge-list option, as described above: however, it is not possible to specify the "diff" features (i.e. modes 6 and 7).

Extract a subset of SNPs: command line options

There are multiple ways to extract just specific SNPs for analysis; this section describes options that use the command-line directly; the next section describes other methods that read a file containing the information.
Based on a single chromosome (--chr)
To analyse only a specific chromosome use
plink --file data --chr 6

Based on a range of SNPs (--from and --to)
To select a specific range of markers (that must all fall on the same chromosome) use, for example:
plink --bfile mydata --from rs273744 --to rs89883

Based on single SNP (and window) (--snp and --window)
Alternatively, you can specify a single SNP and, optionally, also ask for all SNPs in the surrounding region, with the --window option:
plink --bfile mydata --snp rs652423 --window 20

which extracts only SNPs within +/- 20kb of rs652423.
Based on multiple SNPs and ranges (--snps)
Alternatively, the newer --snps command is more flexible but slower than the previously described --snp and --from/--to commands. The --snps command will accept a comma-delimited list of SNPs, including ranges based on physical position. For example,
plink --bfile mydata --snps rs273744-rs89883,rs12345-rs67890,rs999,rs222

selects the same range as above (rs273744 to rs89883) but also the separate range rs273744 to rs89883 as well as the two individual SNPs rs999 and rs222. Note that SNPs need not be on the same chromosome; also, a range can span multiple chromosomes (the range is defined based on chromosome code order in that case, as well as physical position, i.e. a range from a SNP on chromosome 4 to one on chromosome 6 includes all SNPs on chromosome 5). No spaces are allowed between SNP names or ranges, i.e. it is
     --snps rs1111-rs2222,rs3333,rs4444
and not
     --snps rs1111 - rs2222, rs3333 ,rs4444
Hint As mentioned above, unlike other methods mentioned above, --snps will load in all the data before extracting what it needs, whereas --snp only loads in what it needs, as so is a much faster way to extract a region from a very large dataset: as a result, if you really do want only a single SNP or a single range, use --snp (with --window) or some variant of the from/--to commands.
Based on physical position (--from-kb, etc)
One can also select regions based on a window defined in terms of physical distance rather than SNP ID, using the command: e.g.
plink --bfile mydata --chr 2 --from-kb 5000 --to-kb 10000

to select all SNPs within this 5000kb region on chromosome 2 (when using --from-kb and --to-kb you always need to specify the chromosome with the --chr option).

HINT Two alternate forms of the --from-kb command are --from-bp and --from-mb that take a parameter in terms of base-pair position or megabase position, instead of kilobase (to be used with the corresponding --to-bp and --to-mb options).
Based on a set file (--gene)
Finally, if a SET file is also specified, you can use the --gene option to extract all SNPs in that gene/region. For example, if the SET file genes.set contains two genes:
     GENE1
     rs123456
     rs10912
     rs66222
     END

     GENE2
     rs929292
     rs288222
     rs110191
     END
then
plink --file mydata --set genes.set --gene GENE2 --recode

would, for example, create a new dataset with only the 3 SNPs in GENE2.

These options can be used either with standard pedigree files (i.e. using --ped or --file) or with binary format pedigree (BED) files (i.e. using --bfile). One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option.

Extract a subset of SNPs: file-list options

To extract only a subset of SNPs, it is possible to specify a list of required SNPs and make a new file, or perform an analysis on this subset, by using the command
plink --file data --extract mysnps.txt

where the file is just a list of SNPs, one per line, e.g.
     snp005
     snp008
     snp101
Alternatively, you can use the command --range to modify the behavior of --extract and --exclude. If the --range flag is added, then instead of a list of SNPs, PLINK will expect a list of chromosomal ranges to be given instead, one per line.
plink --file data --extract myrange.txt --range

All SNPs within that range will then be excluded or extracted. The format of myrange.txt should be, one range per line, whitespace-separated:
     CHR     Chromosome code (1-22, X, Y, XY, MT, 0)
     BP1     Start of range, physical position in base units
     BP2     End of range, as above
For example,
     2 30000000 35000000
     2 60000000 62000000
     X 10000000 20000000
would extract/exclude all SNPs in these three regions (5Mb and 2Mb on chromosome 2 and 10Mb on chromosome X).

One must combine these options with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option.

Remove a subset of SNPs

To re-write the PED/MAP files, but with certain SNPs excluded, use the option
plink --file data --exclude mysnps.txt

where the file mysnps.txt is, as for the --extract command, just a list of SNPs, one per line. As described above, the --range command can modify the behaviour of --exclude in the same manner as for --extract.

One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option.

NOTE Another way of removing SNPs is to make the physical position negative in the MAP file (this can not be done for binary filesets (e.g. the *.bim file).

Make missing a specific set of genotypes

To blank out a specific set of genotypes, use the following commands, e.g.
	--zero-cluster test.zero  --within test.clst
in conjunction with other data analysis, file generation or summary statistic commands, where the file test.zero is a list of SNPs and clusters, and test.clust is a standard cluster file.

If the original PED file is
     1  1 0 0 1 1   A A  C C  A A 
     2  1 0 0 1 1   C C  A A  C C 
     3  1 0 0 1 1   A C  A A  A C 
     4  1 0 0 1 1   A A  C C  A A 
     5  1 0 0 1 1   C C  A A  C C 
     6  1 0 0 1 1   A C  A A  A C 
     1b 1 0 0 1 1   A A  C C  A A 
     2b 1 0 0 1 1   C C  A A  C C 
     3b 1 0 0 1 1   A C  A A  A C 
     4b 1 0 0 1 1   A A  C C  A A 
     5b 1 0 0 1 1   C C  A A  C C 
     6b 1 0 0 1 1   A C  A A  A C 
and the MAP file is
     1 snp1 0 1000
     1 snp2 0 2000
     1 snp3 0 3000
and the list of SNPs/clusters to zero out in test.zero is
     snp2   C1
     snp3   C1
     snp1   C2
and the cluster file test.clst is
     1b 1 C1
     2b 1 C1
     3b 1 C1
     4b 1 C1
     5b 1 C1
     6b 1 C1
     2  1 C2
     3  1 C2
then the command
plink --file test --zero-cluster test.zero --within test.clst --recode

results in a new PED file, plink.recode.ped,
     1  1 0 0 1  1  A A C C A A
     2  1 0 0 1  1  0 0 A A C C
     3  1 0 0 1  1  0 0 A A A C
     4  1 0 0 1  1  A A C C A A
     5  1 0 0 1  1  C C A A C C
     6  1 0 0 1  1  A C A A A C
     1b 1 0 0 1  1  A A 0 0 0 0
     2b 1 0 0 1  1  C C 0 0 0 0
     3b 1 0 0 1  1  A C 0 0 0 0
     4b 1 0 0 1  1  A A 0 0 0 0
     5b 1 0 0 1  1  C C 0 0 0 0
     6b 1 0 0 1  1  A C 0 0 0 0
i.e. with the appropriate genotypes zeroed out.

HINT See the section on handling obligatory missing genotype data, which can often be useful in this context.

Extract a subset of individuals

To keep only certain individuals in a file, use the option:
plink --file data --keep mylist.txt

where the file mylist.txt is, as for the --remove command, just a list of Family ID / Individual ID pairs, one set per line, i.e. one person per line. (fields can occur after the 2nd column but they will be ignored -- i.e. you could use a FAM file as the parameter of the --keep command, or have comments in the file. For example
   F101   1 
   F1001  2_B
   F3033  1_A  Drop this individual because of consent issues   
   F4442  22
would be fine.

One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option.

Remove a subset of individuals

To remove certain individuals from a file
plink --file data --remove mylist.txt

where the file mylist.txt is, as for the --keep command, just a list of Family ID / Individual ID pairs, one set per line, i.e. one person per line (although, as for --keep, fields after the 2nd column are allowed but they will be ignored).

One must combine this option with the desired analytic (e.g. --assoc), summary statistic (e.g. --freq) or data-generation (e.g. --make-bed) option.

Filter out a subset of individuals

Whereas the option to keep or remove individuals are based on files containing lists, it is also possible to specify a filter to include only certain individuals based on phenotype, sex or some other variable.

The basic form of the command is --filter which takes two arguments, a filename and a value to filter on, for example:
plink --file data --filter myfile.raw 1 --freq

implies a file myfile.raw exists which has a similar format to phenotype and cluster files: that is, the first two columns are family and individual IDs; the third column is expected to be a numeric value (although the file can have more than 3 columns), and only individuals who have a value of 1 for this would be included in any subsequent analysis or file generation procedure. e.g. if myfile.raw were
     F1  I1   2
     F2  I1   7
     F3  I1   1
     F3  I2   1
     F3  I3   3
then only two individuals (F3 I1 and F3 I2) would be included based on this filter for the calculation of allele frequencies. The filter can be any integer numeric value.

As with --pheno and --within, you can specify an offset to read the filter from a column other than the first after the obligatory ID columns. Use the --mfilter option for this. For example, if you have a binary fileset, and so the FAM file contains phenotype as the sixth column, then you could specify
plink --bfile data --filter data.fam 2 --mfilter 4

to select cases only; i.e. cases have the value 2, and this is the 4th variable in the file (i.e. the first two columns are ignored, as these are the ID columns).

Because filtering on cases or controls, or on sex, or on position within the family, will be common operations, there are some shortcut options that can be used instead of --filter. These are
     --filter-cases
     --filter-controls
     --filter-males
     --filter-females
     --filter-founders
     --filter-nonfounders
These flags can be used in any circumstances, e.g. to make a file of control founders,
plink --bfile data --filter-controls --filter-founders --make-bed --out newfile

or to analyse only males
plink --bfile data --assoc --filter-males

IMPORTANT Take care when using these with options to merge filesets: the merging occurs before these filters.

Create a SET file based on a list of ranges

Given a list of ranges in the following format (4 columns per row; no header file)
 
     Chromosome
     Start base-pair position 
     End base-pair position
     Set/range/gene name
then the command
plink --file mydata --make-set gene.list

will generate the file
     plink.set
in the standard set file format. The command --make-set-border takes a single integer argument, allowing for a certain kb window before and after the gene to be included, e.g. for 20kb upstream and downstream:
plink --file mydata --make-set gene.list --make-set-border 20

 

This document last modified Wednesday, 11-Jun-2008 18:50:22 EDT