PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.03 (10-Jun-2008)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | Haplotypes | Conditional tests | Proxy association | Imputation | Clumping | Epistasis | Copy Number Variation | R-plugins | SNP annotation | Simulation | Profiles | Resources | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. Multimarker tests 15. Conditional haplotype tests 16. Proxy association 17. Full imputation (beta) 18. LD-based results clumping 19. Epistasis 20. Copy Number Variation 21. R-plugins 22. SNP annotation lookup 23. Simulation tools 24. Profile scoring 25. Resources 26. Miscellaneous 27. FAQ & Hints

28. gPLINK
 

Analysis of Copy Number Variation data

This page describes some basic file formats, convenience functions and analysis options for both common and rare copy number variant (CNV) data. Two quite distinct representations of CNV data are available, that will typically, although not necessarily, map onto rare versus common variants.

The first way of representing copy number variants is as segments. These segments are essentially represented and analysed in a similar manner to how PLINK handles runs of homozygosity. These options are specified via the --cnv-list option. Allelic (i.e. basic SNP) information is not considered here.

In addition, there is another way to represent copy number variation in specific SNP genotypes, for example, allowing A, AAB or AABB genotypes (being copy number 1,3 and 4 respectively) as well as the canonical AA, AB and BB genotypes. These formats are specified via the "generic variant" (--gfile) option.

Here we assume that some other software package such as the Birdsuite package has previously been used to make calls for either specific copy-number variable genotypes or to identify particular genomic regions in individuals that are deletions or duplications, based on the raw data. That is, PLINK only offers functions for downstream analysis of CNV data, not for identifying CNVs in the first place, i.e. similar to the distinction between SNP genotype calling versus the subsequent analysis of those calls.

Note that in both of these modes, PLINK bypasses the usual procedure of reading in SNP genotype data. You are advised not to specify commands that normally apply to basic SNP data when using either of these two approaches (i.e. either --cnv-list or --gfile).

Basic support for segmental CNV data

The basic command for reading a list of segmental CNV variants is

plink --cnv-list mydata.cnv

where the file mydata.cnv has the format
     FID     Family ID
     IID     Individual ID
     CHR     Chromosome
     BP1     Start position (base-pair)
     BP2     End position (base-pair)
     TYPE    Type of variant, e.g. 0,1 or 3,4 copies
     SCORE   Confidence score associated with variant 
     SITES   Number of probes in the variant
Having a header row is optional; if the first line starts with FID it will be ignored.

Note The SCORE and SITES values are not used in any direct way, except potentially as variates to filter segments on, as described below. That is, the values of these do not fundamentally impact the way analysis is performed by PLINK itself (they might alter the meaning of the results of course, e.g. if including low-confidence calls into the analysis!). In other words, if whatever software was used to generate the CNV calls does not supply some conceptually similar values, it is okay to simply put dummy codes (e.g. all 0) in these two fields.

The first few lines of a small example file is shown here:
     FID    IID   CHR       BP1       BP2  TYPE   SCORE  SITE
     P1     P1      4  71338469  71459318     1      27     0
     P1     P1      5  31250352  32213542     1    34.2     0
     P1     P1      7  53205351  53481230     3    18.2     0
     P2     P2     11  86736484  87074601     1      22     0  
     P2     P2     14  47817280  47930190     4    55.1     0
     ...
Initial creaton of a MAP file
Prior to any analysis, a dummy MAP first needs to be created (this step only needs to be performed once per CNV file). This PLINK-generated MAP file has dummy entries that correspond to the start and stop sites of all segments. This facilitates subsequent parsing and analysis of CNV data by PLINK. The --cnv-make-map command is used as follows:
plink --cnv-list mydata.cnv --cnv-make-map

which creates a file
     plink.cnv.map
which will look just like a standard MAP file but with dummy markers:
     1       p1-51593   0    51593
     1       p1-51598   0    51598
     1       p1-51666   0    51666
     1       p1-52282   0    52282
     1       p1-69061   0    69061
     ...
where the marker names start with the p prefix and contain chromosome and base-position information.

As an (unrealistic) example to illustrate how the mapping works, consider the following, with 3 segments, spanning "positions" 1 to 8, 4 to 12 and 16 to 23. In this case, 6 unqiue map positions would be created, the three start positions and the three stop positions.
        Base                1111111111222222
        Position   1234567890123456789012345  

        Marker #   1..2...3...4...5......6..
                   |  |   |   |   |      |
                              |   |      |
        Segments   *------*       |      |
                      *-------*             
                                  *------*
The new MAP file would then be
     1 p1-1   0  1
     1 p1-4   0  4
     1 p1-8   0  8
     1 p1-12  0  12
     1 p1-16  0  16
     1 p1-23  0  23
Given such a MAP file, these three segments would then be perfectly mapped to the corresponding markers (p1-1 to p1-8, p1-4 to p1-12 and p1-16 to p1-23). The created MAP file is then specified in subsequent segmental CNV analyses (using --cnv-list) with the standard --map command (or --cfile command).
Loading and analysing a CNV segment list

Once a suitable MAP file has been created, i.e. with dummy markers that correspond to the position of every start and stop site of all segments, use the --cnv-list command again to load in the CNV segment data: in addition to the basic CNV file, a MAP (previously generated) and FAM file (continaing ID and phenotype information) also need to be specified. For example.
plink --map plink.cnv.map --fam mydata.fam --cnv-list mydata.cnv

Alternatively, if the MAP, FAM and CNV list files all have the same root, the command
plink --cfile study1

is equivalent, i.e. it implies the following files exist
     study1.cnv
     study1.cnv.map
     study1.fam
By default either command will simply load in the CNV data and produce a report in the LOG file, enumerating the number of CN states in the total dataset and any filtering processes applied. For example,
     Reading segment list (CNVs) from [ cnv1.list ]
     714 of 2203 mapped as valid segments
     1872 mapped to a person, of which 714 passed filters
        CopyN  Count
        0      46
        1      339
        3      200
        4      129
     Writing segment summary to [ plink.cnv.indiv ]
This indicates that of 2203 total segments (i.e. should correspond to number of lines in the cnv1.list file, allowing for any header) 1872 are mapped to a person in the dataset. In other words, some of the segments in cnv1.list are for individuals not in cnv1.fam. These are simply ignored; for example, these individuals might have been filtered out of the study for other reasons, e.g. QC based on standard SNP genotypes. Of these, 714 passed the further set of filters, as described below. Segments can be filtered based on genomic location, size, quality score/number of sites and type (duplication or deletion).

It will also be reported in the LOG file if some of the segments do not map to a marker in the MAP file: if this is because you've used --chr or similar commands to restrict the portion of the data examined, you can safely ignore this line; otherwise, it might mean that the appropriate MAP file wasn't created (e.g. using --cnv-make-map) for that CNV file.

By default, PLINK will create a file that summarises per individual events (after any filtering has been applied), in a file named
     plink.cnv.indiv
which has the fields, one row per person, in the same order as the original FAM file:
     FID     Family ID
     IID     Individual ID
     PHE     Phenotype 
     NSEG    Number of segments that individual has
     KB      Total kilobase distance spanned by segments 
     KBAVG   Average segment size
Note PLINK does not check to see whether segments are overlapping for the same person or not (e.g. if a deletion and a duplication event had been specified for the same person in the same region, or if the same event is listed twice).

Filtering options for segmental CNV data

The segments read in can be filtered in a number of ways. First, one can specify to read in only either deletions (TYPE is less than 2) or duplications (TYPE is greater than 2), with the options,
   --cnv-del
and
   --cnv-dup
Segments can also be filtered based on a minimum size (kb), score or number of sites contributing with the following commands:
   --cnv-kb 50 
   --cnv-score 3
   --cnv-sites 5
The default minimum segment size is 20kb; none of the other filters have a default setting that would exclude anything. Also, corresponding maximum thresholds can be set:
   --cnv-max-kb 2000
   --cnv-max-score 10
   --cnv-max-sites 10

The set of individuals for whom segment data are based on can be modified with the standard --keep and --remove options, to exclude people from the analysis.
Filtering the segment list by intersection with specific genomic co-ordinates
It is possible to extract a specific set of segments that overlap with one or more regions as specified in a file, e.g. that might contain the genomic co-ordinates for genes or segmental duplications, etc. Use the command
     --cnv-intersect regions.list
in addition with --cnv-list to do so. The file regions.list should be in the following format: one range per line, whitespace-separated:
     CHR     Chromosome code (1-22, X, Y, XY, MT, 0)
     BP1     Start of range, physical position in base units
     BP2     End of range, as above
     MISC    Any other fields after 3rd ignored
For example,
     2 30000000 35000000  REGION1
     2 60000000 62000000
     X 10000000 20000000  Linkage hotspot
would extract/exclude all segments that at least partially span these three regions (5Mb and 2Mb on chromosome 2 and 10Mb on chromosome X), ignoring the comments.

Alternatively, you can use
     --cnv-exclude regions.list
to filter out a specific set of segments, i.e. any that overlap with one or more regions specified in the file regions.list.

The basic intersection or exclusion commands will select all segments that are at least partially in the specified region. Alternatively, one can select only segments that have at least X percent of them in the specified region, for example
     --cnv-overlap 0.50
would only include (--cnv-intersect), or exclude (--cnv-exclude), events that have at least 50% of their length spanned by the region.

There are two other variant forms of the overlap command, which change the denominator in calculating the proportion overlap:
     --cnv-union-overlap 0.50 
which defines overlap as the ratio of the intersection and the union, also
     --cnv-region-overlap 0.50
which defines overlap as the ratio of the intersection and the length of the region (rather than the CNV). For example,

     ------|-----|---------------------   Region/gene
     ----------+++++++++++++++---------   CNV (duplication, +)

     ----------XXX---------------------   Intersection

     ------XXXXXXX---------------------   Denominator for basic overlap
     ------XXXXXXXXXXXXXXXXXXX---------   Denominator for union overlap
     ----------XXXXXXXXXXXXXXX---------   Denominator for region overlap
In this example, if we take each character to represent a standard length
     Default 3/7 = 
     Union overlap = 3 / 19
     Region overlap = 3 / 15
This next example illustrates how the overlap statistics can then subsequently be used to include or exclude specific CNVs,
     ------|----------|----------------     
         
     --------OOOOOOOOOOXXX-------------   Selected

     ------------OOOOOOXXXXXXXXXXXXXX--   Not selected
i.e. the default setting is equivalent to setting --cnv-overlap 0 (i.e. more than 0% must overlap).

Generating lists of intersected genes or regions
With the --cnv-intersect (or --cnv-exclude) command, you can add the flag
     --cnv-report-regions
which will create a file
     plink.reg
listing only the regions that intersect (or do not intersect) with any of the CNVs.
Frequency-based inclusion and exclusion criteria

It is also possible to exclude based on the frequency of CNVs at a particular position. This works by defining regions based on frequency and then applying the same routines as for the range-intersection command described above. There are various commands: to remove segments that map to regions with more than 10 segments
     --cnv-freq-exclude-above 10
to remove any segments that only have at most 4 copies
     --cnv-freq-exclude-below 5
to remove any segments not in regions with exactly 5 copies
     --cnv-freq-exclude-exact 5
and correspondingly
     --cnv-freq-include-exact 5
or to keep only segments that are unique to either cases or to controls
     --cnv-unique
This third option (--cnv-unique) can be used in conjunction with either of the first two. As with the earlier range intersection commands, the definition of intersection can be soft, specified with the --cnv-overlap option. In most cases here, one might want to specify --cnv-overlap 0.5 to avoid removing too many segments.

For example, given the following segments, and counts below
        Segments   *------*       
                      *------------*             
                                  *------*
          Counts 001112222211111112211111100000
  Common regions      XXXXX       XX
then --cnv-freq-exclude-above 1 would remove all three segments if --cnv-overlap 0 (the default). However, if --cnv-overlap were instead set to 0.5, for example, then only the top segment would be removed (as the other two segments have more than 50% of their length outside of a region with more than 1 segment).

NOTE Because multiple CNVs at the same region will not all exactly overlap, and may be spanned by distinct larger events, or contain smaller events, in other individuals, then requesting that you include only CNVs with exactly five copies for example (--cnv-freq-include-exact 5 does not mean that at all positions in the genome you will always see either 0 or 5 copies. Rather, the selection process works exactly as specified above.

Frequency-based inclusion and exclusion criteria

To drop individuals from the file who do not have at least one segment after filtering, add the flag
     --cnv-drop-no-segment
This can make the plink.cnv.indiv summary files easy to browse, for example.
Positional filtering of CNVs

In addition, the standard commands for filtering chromosomal positions are still applicable, for example
     --chr 5
or
     --chr 2 --from-mb 20 --to-mb 25
Writing new CNV lists

Finally, given a set of filters applied, you can output as a new CNV file the filtered subset, with the command
     --cnv-write
For example, to make a new file using only deletions over 200kb but not more than 1000kb, with a quality score of 10 or more, use the command
 plink --cfile cnv1 
       --cnv-del 
       --cnv-kb 200 
       --cnv-max-kb 1000 
       --cnv-score 10
       --cnv-write 
       --out hiqual-large-deletions 

which will generate two new files
     hiqual-large-deletions.cnv
     hiqual-large-deletions.fam
To obtan a corresponding MAP file (so that you can subsequently use the
     --cfile hiqual-large-deletions
option, use the command
plink --cnv-list hiqual-large-deletions.cnv --cnv-make-map --out hiqual-large-deletions

(although note that this will overwrite the LOG file generated by the --cnv-write command).

Association analysis of segmental CNV data

To perform a set of global test of CNV burden in cases versus controls, add the
     --cnv-indiv-perm
option as well as
     --mperm 10000
for example (i.e. permutation is required). By default, this reports on four tests, which use these metrics to calculate burden in both cases and controls
     RATE    Number of segments
     PROP    Proportion of sample with one or more segment
     TOTKB   Total kb length spanned
     AVGKB   Average segment size
Tests are based (1-sided) on comparing these metrics in cases versus controls, evaluated by permutation.

If a list of regions is supplied in a file, e.g. gene.list and the command
     --cnv-count gene.list
then an extra test is added
     GCNT   Number of regions/genes spanned by CNVs
These tests respect all the normal filtering commands, with the exception that --cnv-intersect and --cnv-exclude cannot be used if --cnv-count is also being used.

The mean metrics in cases and controls are reported in the file
     plink.cnv.grp.summary
when the --cnv-indiv-perm command is used. For example: this gives the number of events (N) in cases and controls, the rate per person, the proportion of cases/controls to have at least one event, the total distance spanned per person and the average event size per person.
        TEST      GRP      AFF    UNAFF
           N      ALL      528      362
        RATE      ALL   0.1557   0.1138
        PROP      ALL   0.1309   0.1041
       TOTKB      ALL    290.8    265.4
       AVGKB      ALL    249.8    243.3

As usual, if the --within command is added and a cluster file specified, then any permutations are performed within cluster. In this case, the statistics displayed in the plink.cnv.grp.summary file are also split out by the strata as well as presented in total (as indicated by the GRP field).

Association mapping with segmental CNV data

To perform a simple permutation-based test of association of segmental CNV data for case/control phenotypes, add the option
     --mperm 50000
as well as --cnv-list, for perform, for example, 50,000 null permutations to generate empirical p-values. The results are saved in two files
     plink.cnv.summary
     plink.cnv.summary.mperm
The first file has the fields
     CHR    Chromosome code
     SNP    SNP identifier (dummy SNP, see below)
     BP     Base-pair position
     AFF    Number of affected individuals with a segment at this position
     UNAFF  Number of unaffected individuals 
To instead perform a 2-sided test (i.e. allowing that events might be more common in controls) add the flag
     --cnv-2sided

To perform an analysis in which the total number of events within a sliding window is compared between cases and controls (rather than the number overlapping a single position) add the flag
     --cnv-test-window 50
where the parameter is the kb window either side of the test position. As before, the association results are reported per marker, but now the counts indicate the total number of segments that overlap any of the 100kb window surrounding the test position, rather than just the test position itself. Significance is evaluated by permutation as before.

Verbose output and grouping options for segmental CNV data

Finally, there are two option to group or report sets of segments that span a particular position. In the first case, use the option
     --segment-group
which takes all segments in a given region (whole genome unless otherwise specified) and forms "pools" of overlapping segments. Several pools of overlapping segments will be created; these will be listed in order of decreasing size (number of segments); note that the same segment can appear in multiple pools (e.g. if A overlaps with C, and B overlaps with C, but A and B do not overlap). The pools give information as described below.

The more restricted form of this command forms a single pool of all segments that overlap a particular position, which takes a single parameter of a marker name; typically these will be the dummy pos* markers created by the --cnv-make-map command.
     --segment-spanning pos119
In this case, for some made-up data, we see from the plink.cnv.summary file that there are 8 cases and 6 controls with a segment spanning a particular position, pos586
      CHR        SNP           BP      AFF    UNAFF
      ...        ...          ...      ...      ...
        1     pos586     16631570        8        6
      ...        ...          ...      ...      ...
In this case, there is unsurprisingly no association between segmental CNVs and disease: for example, the corresponding position in the plink.cnv.summary.mperm file shows an empirical p-value of 0.35, but of p=1 if adjusted for multiple testing (EMP2)
      CHR     SNP         STAT         EMP1         EMP2
      ...     ...          ...          ...          ...
        1  pos586     0.419408     0.351324            1
      ...     ...          ...          ...          ...
Naturally, one would usually be more interested in following up significantly associated regions of course... Nonetheless, if so desired we can see which segments (given any the filtering specified) are spanning this position, with --segment-spanning, which gives the following:
      POOL       FID       IID    PHE  CHR        BP1       BP2       KB  TYPE  SCORE
        S1   PT-2378   PT-2378      2   12   16631570  16751087  119.517   DEL  10.23
        S1   PT-268D   PT-268D      2   12   16631494  16732162  100.668   DEL    9.3
        S1   PT-2M8O   PT-2M8O      1   12   16631441  16751082  119.641   DEL  31.23
        S1   PT-2FZ9   PT-2FZ9      2   12   16631436  16751045  119.609   DEL   15.2
        S1   PT-287D   PT-287D      1   12   16616579  17183201  566.622   DUP  200.3
        S1   PT-2C91   PT-2C91      2   12   16616579  16751045  134.466   DEL   14.3
        S1   PT-28A8   PT-28A8      1   12   16616579  16751045  134.466   DEL    8.3
        S1   PT-2FPB   PT-2FPB      1   12   16616579  16714372   97.793   DEL   11.1
        S1   PT-28IG   PT-28IG      2   12   16616579  16708856   92.277   DEL   10.3
        S1   PT-2E5N   PT-2E5N      2   12   16614664  16715703  101.039   DEL   9.87
        S1   PT-2FVL   PT-2FVL      1   12   16614664  16751045  136.381   DEL  10.67
        S1   PT-2DYE   PT-2DYE      2   12   16614664  16715489  100.825   DEL  11.82
        S1   PT-264I   PT-264I      2   12   16614664  16751045  136.381   DEL   14.2
        S1   PT-25WZ   PT-25WZ      1   12   16591338  16715767  124.429   DEL   14.7
        S1       CON        14    8:6   12   16631570  16708856   77.286    NA     NA
        S1     UNION        14    8:6   12   16591338  17183201  591.863    NA     NA
For CNV data (in contrast to shared segments based on homozygosity or IBD sharing) the extra fields of TYPE (deletion or duplication) and SCORE (some metric of quality/confidence of CNV call) are also presented.

Here we see the 14 segments listed, 8 cases and 6 controls. The CON and UNION lines at the end of the pool give the consensus region (i.e. shared by all segments) and the total distance spanned by all. The PHE field gives the phenotype for each individual.

Note that the way in which the dummy markers are selected will effectively mean that every possibly unique position, in terms of counts of segments, is evaluated. The actual base pair regions of any dummy marker is itself probably not of interest: given a sigificant (set of) SNPs, the strategy would be to select any one and generate the corresponding pool to see what and where the association maps to.

Format for common CNVs (generic variant format)

For common CNVs, that might also have meaningful allelic/SNP variation, it can be desirable to represent and analyse these not as segments. The rest of the page considers non-segmental specification of CNVs: that is, copy-number variable specific genotype calls, such as A or AAB.

Such data are represented with the generic variant file format, and read into PLINK with the command:
plink --gfile mydata

where three files are assumed to exist
     mydata.fam     (describes individuals, as usual)
     mydata.map     (describes variants, as usual)
     mydaya.gvar    (new file format)     
The .gvar file is in long-format: always with 7 fields, one row per genotype (note that the reference to the first and second parents above does not imply that paternal or maternal origin should be known or is used)
  FID       Family ID
  IID       Individual ID  (i.e. person should appear in .fam file)
  NAME      Variant name (should appear in .map file)
  ALLELE1   Code for allele from first parent
  DOSAGE1   Copy number for first allele
  ALLELE2   Code for allele from second parent
  DOSAGE2   Copy number for second allele
Some example of using this format to represent different genotypes are shown here:
     1 1  var1  A 1  C 1    -> normal het
     1 1  var2  A 2  C 1    -> AAC genotype
     1 1  var3  0 1  0 1    -> missing individual
     1 1  var4  0 0  0 0    -> homozygous deletion

     1 1  var5  4 1  7 1    -> e.g. 4/7 genotype
     2 1  var5  4 1  8 1    -> e.g. 4/8 genotype

     1 1  var6  A 0.95 C 1.05  -> expected allele dosage (e.g. from imputation)
As currently implemented, all the codings below would be equivalent, i.e. specifying an AA homozygote:
     1 1  var7  A 1 A 1
     2 1  var7  A 0 A 2
     3 1  var7  A 2 A 0
     4 1  var7  X 0 A 2
     5 1  var7  0 0 A 2
That is, if there is 0 copy number, then it does not matter what is specified in the allele field; otherwise, a value of 0 in the allele field is taken to mean a missing genotype (rather than a null/deleted genotype).

When loading this kind of file, PLINK will parse allelic and copy number variation; currently by default it looks for integer dosage calls in this part of the process. There are currently no functions implemented yet for fractional counts, but the datatype exists.

Alleles and CNVs are then appropriately counted. PLINK assesses and records for each variant whether there is allelic and/or copy number variation, and this influences downstream analysis. Currently variation is defined as at least one individual varying, but in the future thresholds will be added (e.g. to treat a site of a CNV only if, say, 1% of all individuals have a non-canonical copy number).

The basic summary output is also in "long format": in the future this will be expanded and reformated, e.g. to include specific allelic/CNV frequncies or counts; stratification by phenotype, etc. This summary file is called
     plink.gvar.summary
and always contains three columns, as illustrated here
            NAME        FIELD        VALUE
            var1          CHR            1
            var1           BP            1
            var1          CNV          yes
            var1      ALLELIC          yes
            var1       GCOUNT         1000
            var1            B       0.6031
            var1            A       0.3969
            var1          [2]         0.56
            var1          [3]        0.378
            var1          [4]        0.062
            var1          B/B        30:38
            var1         BB/B        66:60
            var1        BB/BB        42:20
            var1          A/B      142:101
            var1         A/BB       161:91
            var1          A/A       162:87
The CN counts are always in [x] to distingush from allele codes, if they are also numeric. e.g. in this example, 37.5% of sample have the deletion for example. There can be more than 2 CN states for a given variant.

If the trait is binary, then the counts for copy-number specific genotypes (e.g. A/BB will be given separately for cases and controls, separated by a colon.

Association models for combined SNP and common CNV data

PLINK has implemented the following regression models (logistic or linear) currently applicable to biallelic SNPs residing within CNPs:
  Y ~ b0 + b1.(A+B) + b2.(A-B)
When an association test is performed, extra lines will be appended to the plink.gvar.summary file
            var1       B(SNP)     -0.05955
            var1       P(SNP)      0.09085
            var1       B(CNP)      0.09314
            var1       P(CNP)       0.3809
            var1   B(CNP|SNP)       0.5638
            var1   P(CNP|SNP)    0.0006768
            var1   B(SNP|CNP)      -0.2042
            var1   P(SNP|CNP)    0.0002242
            var1   P(SNP&CNP)    0.0007413
This section is not finished -- more details will be added online presently.
 
This document last modified Thursday, 24-Jul-2008 17:11:20 EDT OA