|
1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. Multimarker tests
15. Conditional haplotype tests
16. Proxy association
17. Full imputation (beta)
18. LD-based results clumping
19. Epistasis
20. Copy Number Variation
21. R-plugins
22. SNP annotation lookup
23. Simulation tools
24. Profile scoring
25. Resources
26. Miscellaneous
27. FAQ & Hints
28. gPLINK
|
|
Analysis of Copy Number Variation data
This page describes some basic file formats, convenience functions and
analysis options for both common and rare copy number variant (CNV)
data. Two quite distinct representations of CNV data are available,
that will typically, although not necessarily, map onto rare versus
common variants.
The first way of representing copy number variants is
as segments. These segments are essentially represented and
analysed in a similar manner to how PLINK handles runs of
homozygosity. These options are specified via the --cnv-list
option. Allelic (i.e. basic SNP) information is not considered here.
In addition, there is another way to represent copy number variation
in specific SNP genotypes, for example,
allowing A, AAB or AABB genotypes (being
copy number 1,3 and 4 respectively) as well as the
canonical AA, AB and BB genotypes. These
formats are specified via the "generic variant" (--gfile)
option.
Here we assume that some other software package such as
the Birdsuite
package has previously been used to make calls for either specific
copy-number variable genotypes or to identify particular genomic
regions in individuals that are deletions or duplications, based on
the raw data. That is, PLINK only offers functions for downstream
analysis of CNV data, not for identifying CNVs in the first place,
i.e. similar to the distinction between SNP genotype calling versus
the subsequent analysis of those calls.
Note that in both of these modes, PLINK bypasses the usual
procedure of reading in SNP genotype data. You are advised not to
specify commands that normally apply to basic SNP data when using
either of these two approaches (i.e. either --cnv-list
or --gfile).
Basic support for segmental CNV data
The basic command for reading a list of segmental CNV variants is
plink --cnv-list mydata.cnv
where the file mydata.cnv has the format
FID Family ID
IID Individual ID
CHR Chromosome
BP1 Start position (base-pair)
BP2 End position (base-pair)
TYPE Type of variant, e.g. 0,1 or 3,4 copies
SCORE Confidence score associated with variant
SITES Number of probes in the variant
Having a header row is optional; if the first line starts with FID
it will be ignored.
Note The SCORE and SITES values
are not used in any direct way, except potentially as variates to filter
segments on, as described below. That is, the values of these do not
fundamentally impact the way analysis is performed by PLINK
itself (they might alter the meaning of the results of course, e.g. if
including low-confidence calls into the analysis!). In other words, if
whatever software was used to generate the CNV calls does not supply
some conceptually similar values, it is okay to simply put dummy codes
(e.g. all 0) in these two fields.
The first few lines of a small example file is shown here:
FID IID CHR BP1 BP2 TYPE SCORE SITE
P1 P1 4 71338469 71459318 1 27 0
P1 P1 5 31250352 32213542 1 34.2 0
P1 P1 7 53205351 53481230 3 18.2 0
P2 P2 11 86736484 87074601 1 22 0
P2 P2 14 47817280 47930190 4 55.1 0
...
Initial creaton of a MAP file
Prior to any analysis, a dummy MAP first needs to be created (this
step only needs to be performed once per CNV file). This
PLINK-generated MAP file has dummy entries that correspond to the
start and stop sites of all segments. This facilitates subsequent
parsing and analysis of CNV data by PLINK.
The --cnv-make-map command is used as follows:
plink --cnv-list mydata.cnv --cnv-make-map
which creates a file
plink.cnv.map
which will look just like a standard MAP file but with
dummy markers:
1 p1-51593 0 51593
1 p1-51598 0 51598
1 p1-51666 0 51666
1 p1-52282 0 52282
1 p1-69061 0 69061
...
where the marker names start with the p prefix and contain
chromosome and base-position information.
As an (unrealistic) example to illustrate how the mapping works,
consider the following, with 3 segments, spanning "positions" 1 to 8,
4 to 12 and 16 to 23. In this case, 6 unqiue map positions would be
created, the three start positions and the three stop positions.
Base 1111111111222222
Position 1234567890123456789012345
Marker # 1..2...3...4...5......6..
| | | | | |
| | |
Segments *------* | |
*-------*
*------*
The new MAP file would then be
1 p1-1 0 1
1 p1-4 0 4
1 p1-8 0 8
1 p1-12 0 12
1 p1-16 0 16
1 p1-23 0 23
Given such a MAP file, these three segments would then be perfectly
mapped to the corresponding markers (p1-1 to p1-8,
p1-4 to p1-12 and p1-16
to p1-23). The created MAP file is then specified in
subsequent segmental CNV analyses (using --cnv-list) with the
standard --map command (or --cfile command).
Loading and analysing a CNV segment list
Once a suitable MAP file has been created, i.e. with dummy markers
that correspond to the position of every start and stop site of all
segments, use the --cnv-list command again to load in the CNV
segment data: in addition to the basic CNV file, a MAP (previously
generated) and FAM file (continaing ID and phenotype information) also
need to be specified. For example.
plink --map plink.cnv.map --fam mydata.fam --cnv-list mydata.cnv
Alternatively, if the MAP, FAM and CNV list files all have the same
root, the command
plink --cfile study1
is equivalent, i.e. it implies the following files exist
study1.cnv
study1.cnv.map
study1.fam
By default either command will simply load in the CNV data and produce
a report in the LOG file, enumerating the number of CN states in the
total dataset and any filtering processes applied. For example,
Reading segment list (CNVs) from [ cnv1.list ]
714 of 2203 mapped as valid segments
1872 mapped to a person, of which 714 passed filters
CopyN Count
0 46
1 339
3 200
4 129
Writing segment summary to [ plink.cnv.indiv ]
This indicates that of 2203 total segments (i.e. should correspond to
number of lines in the cnv1.list file, allowing for any
header) 1872 are mapped to a person in the dataset. In other words,
some of the segments in cnv1.list are for individuals not in
cnv1.fam. These are simply ignored; for example, these
individuals might have been filtered out of the study for other
reasons, e.g. QC based on standard SNP genotypes. Of these, 714 passed
the further set of filters, as described below. Segments can be
filtered based on genomic location, size, quality score/number of
sites and type (duplication or deletion).
It will also be reported in the LOG file if some of the segments do
not map to a marker in the MAP file: if this is because you've
used --chr or similar commands to restrict the portion of the
data examined, you can safely ignore this line; otherwise, it might
mean that the appropriate MAP file wasn't created
(e.g. using --cnv-make-map) for that CNV file.
By default, PLINK will create a file that summarises per
individual events (after any filtering has been applied), in a file
named
plink.cnv.indiv
which has the fields, one row per person, in the same order as the original FAM file:
FID Family ID
IID Individual ID
PHE Phenotype
NSEG Number of segments that individual has
KB Total kilobase distance spanned by segments
KBAVG Average segment size
Note PLINK does not check to see whether segments are
overlapping for the same person or not (e.g. if a deletion and a
duplication event had been specified for the same person in the same
region, or if the same event is listed twice).
Filtering options for segmental CNV data
The segments read in can be filtered in a number of ways. First, one
can specify to read in only either deletions (TYPE is less
than 2) or duplications (TYPE is greater than 2), with the
options,
--cnv-del
and
--cnv-dup
Segments can also be filtered based on a minimum size (kb), score or
number of sites contributing with the following commands:
--cnv-kb 50
--cnv-score 3
--cnv-sites 5
The default minimum segment size is 20kb; none of the other filters
have a default setting that would exclude anything. Also,
corresponding maximum thresholds can be set:
--cnv-max-kb 2000
--cnv-max-score 10
--cnv-max-sites 10
The set of individuals for whom segment data are based on can be
modified with the standard --keep and --remove
options, to exclude people from the analysis.
Filtering the segment list by intersection
with specific genomic co-ordinates
It is possible to extract a specific set of segments that overlap with
one or more regions as specified in a file, e.g. that might contain
the genomic co-ordinates for genes or segmental duplications, etc. Use the command
--cnv-intersect regions.list
in addition with --cnv-list to do so. The file regions.list
should be in the following format: one range per line, whitespace-separated:
CHR Chromosome code (1-22, X, Y, XY, MT, 0)
BP1 Start of range, physical position in base units
BP2 End of range, as above
MISC Any other fields after 3rd ignored
For example,
2 30000000 35000000 REGION1
2 60000000 62000000
X 10000000 20000000 Linkage hotspot
would extract/exclude all segments that at least partially span these
three regions (5Mb and 2Mb on chromosome 2 and 10Mb on chromosome X), ignoring
the comments.
Alternatively, you can use
--cnv-exclude regions.list
to filter out a specific set of segments, i.e. any that overlap with
one or more regions specified in the file regions.list.
The basic intersection or exclusion commands will select all segments
that are at least partially in the specified region. Alternatively,
one can select only segments that have at least X percent
of them in the specified region, for example
--cnv-overlap 0.50
would only include (--cnv-intersect), or exclude (--cnv-exclude),
events that have at least 50% of their length spanned by the region.
There are two other variant forms of the overlap command, which change
the denominator in calculating the proportion overlap:
--cnv-union-overlap 0.50
which defines overlap as the ratio of the intersection and the union,
also
--cnv-region-overlap 0.50
which defines overlap as the ratio of the intersection and the length
of the region (rather than the CNV). For example,
------|-----|--------------------- Region/gene
----------+++++++++++++++--------- CNV (duplication, +)
----------XXX--------------------- Intersection
------XXXXXXX--------------------- Denominator for basic overlap
------XXXXXXXXXXXXXXXXXXX--------- Denominator for union overlap
----------XXXXXXXXXXXXXXX--------- Denominator for region overlap
In this example, if we take each character to represent a standard length
Default 3/7 =
Union overlap = 3 / 19
Region overlap = 3 / 15
This next example illustrates how the overlap statistics can then
subsequently be used to include or exclude specific CNVs,
------|----------|----------------
--------OOOOOOOOOOXXX------------- Selected
------------OOOOOOXXXXXXXXXXXXXX-- Not selected
i.e. the default setting is equivalent to setting
--cnv-overlap 0 (i.e. more than 0% must
overlap).
Generating lists of intersected genes or regions
With the --cnv-intersect (or --cnv-exclude) command,
you can add the flag
--cnv-report-regions
which will create a file
plink.reg
listing only the regions that intersect (or do not intersect) with any of the
CNVs.
Frequency-based inclusion and exclusion criteria
It is also possible to exclude based on the frequency of CNVs at a
particular position. This works by defining regions based on
frequency and then applying the same routines as for the
range-intersection command described above. There are various
commands: to remove segments that map to regions with more than 10
segments
--cnv-freq-exclude-above 10
to remove any segments that only have at most 4 copies
--cnv-freq-exclude-below 5
to remove any segments not in regions with exactly 5 copies
--cnv-freq-exclude-exact 5
and correspondingly
--cnv-freq-include-exact 5
or to keep only segments that are unique to either cases or to controls
--cnv-unique
This third option (--cnv-unique) can be used in conjunction
with either of the first two. As with the earlier range intersection
commands, the definition of intersection can be soft, specified
with the --cnv-overlap option. In most cases here, one might want
to specify --cnv-overlap 0.5 to avoid removing too many segments.
For example, given the following segments, and counts below
Segments *------*
*------------*
*------*
Counts 001112222211111112211111100000
Common regions XXXXX XX
then --cnv-freq-exclude-above 1 would remove
all three segments if --cnv-overlap 0 (the default).
However, if --cnv-overlap were instead set to 0.5, for
example, then only the top segment would be removed (as the other
two segments have more than 50% of their length outside of a region
with more than 1 segment).
NOTE Because multiple CNVs at the same region will
not all exactly overlap, and may be spanned by distinct larger events,
or contain smaller events, in other individuals, then requesting that
you include only CNVs with exactly five copies for example
(--cnv-freq-include-exact 5 does not mean that at all
positions in the genome you will always see either 0 or 5
copies. Rather, the selection process works exactly as specified
above.
Frequency-based inclusion and exclusion criteria
To drop individuals from the file who do not have at least one segment after
filtering, add the flag
--cnv-drop-no-segment
This can make the plink.cnv.indiv summary files easy to browse, for example.
Positional filtering of CNVs
In addition, the standard commands for filtering chromosomal positions
are still applicable, for example
--chr 5
or
--chr 2 --from-mb 20 --to-mb 25
Writing new CNV lists
Finally, given a set of filters applied, you can output as a new CNV file
the filtered subset, with the command
--cnv-write
For example, to make a new file using only deletions over 200kb but not more
than 1000kb, with a quality score of 10 or more, use the command
plink --cfile cnv1
--cnv-del
--cnv-kb 200
--cnv-max-kb 1000
--cnv-score 10
--cnv-write
--out hiqual-large-deletions
which will generate two new files
hiqual-large-deletions.cnv
hiqual-large-deletions.fam
To obtan a corresponding MAP file (so that you can subsequently use the
--cfile hiqual-large-deletions
option, use the command
plink --cnv-list hiqual-large-deletions.cnv --cnv-make-map --out hiqual-large-deletions
(although note that this will overwrite the LOG file generated by the --cnv-write
command).
Association analysis of segmental CNV data
To perform a set of global test of CNV burden in cases versus
controls, add the
--cnv-indiv-perm
option as well as
--mperm 10000
for example (i.e. permutation is required). By default, this reports on four tests,
which use these metrics to calculate burden in both cases and controls
RATE Number of segments
PROP Proportion of sample with one or more segment
TOTKB Total kb length spanned
AVGKB Average segment size
Tests are based (1-sided) on comparing these metrics in cases versus controls,
evaluated by permutation.
If a list of regions is supplied in a file, e.g. gene.list and the command
--cnv-count gene.list
then an extra test is added
GCNT Number of regions/genes spanned by CNVs
These tests respect all the normal filtering commands, with the exception
that --cnv-intersect and --cnv-exclude cannot be used if
--cnv-count is also being used.
The mean metrics in cases and controls are reported in the file
plink.cnv.grp.summary
when the --cnv-indiv-perm command is used. For example: this gives
the number of events (N) in cases and controls, the rate per person,
the proportion of cases/controls to have at least one event, the total distance
spanned per person and the average event size per person.
TEST GRP AFF UNAFF
N ALL 528 362
RATE ALL 0.1557 0.1138
PROP ALL 0.1309 0.1041
TOTKB ALL 290.8 265.4
AVGKB ALL 249.8 243.3
As usual, if the --within command is added and a cluster file
specified, then any permutations are performed within cluster. In this
case, the statistics displayed in the plink.cnv.grp.summary
file are also split out by the strata as well as presented in total
(as indicated by the GRP field).
Association mapping with segmental CNV data
To perform a simple permutation-based test of association of segmental
CNV data for case/control phenotypes, add the option
--mperm 50000
as well as --cnv-list, for perform, for example, 50,000 null permutations
to generate empirical p-values. The results are saved in two files
plink.cnv.summary
plink.cnv.summary.mperm
The first file has the fields
CHR Chromosome code
SNP SNP identifier (dummy SNP, see below)
BP Base-pair position
AFF Number of affected individuals with a segment at this position
UNAFF Number of unaffected individuals
To instead perform a 2-sided test (i.e. allowing that events might be more common in controls) add
the flag
--cnv-2sided
To perform an analysis in which the total number of events within a sliding window is compared between cases and
controls (rather than the number overlapping a single position) add the flag
--cnv-test-window 50
where the parameter is the kb window either side of the test position. As before, the
association results are reported per marker, but now the counts indicate the total number
of segments that overlap any of the 100kb window surrounding the test position, rather than
just the test position itself. Significance is evaluated by permutation as before.
Verbose output and grouping options for segmental CNV data
Finally, there are two option to group or report sets of segments that
span a particular position. In the first case, use the option
--segment-group
which takes all segments in a given region (whole genome unless
otherwise specified) and forms "pools" of overlapping
segments. Several pools of overlapping segments will be created; these
will be listed in order of decreasing size (number of segments); note
that the same segment can appear in multiple pools (e.g. if A overlaps
with C, and B overlaps with C, but A and B do not overlap). The pools
give information as described below.
The more restricted form of this command forms a single pool of all
segments that overlap a particular position, which takes a single
parameter of a marker name; typically these will be the dummy pos*
markers created by the --cnv-make-map command.
--segment-spanning pos119
In this case, for some made-up data, we see from
the plink.cnv.summary file that there are 8 cases and 6
controls with a segment spanning a particular
position, pos586
CHR SNP BP AFF UNAFF
... ... ... ... ...
1 pos586 16631570 8 6
... ... ... ... ...
In this case, there is unsurprisingly no association between segmental
CNVs and disease: for example, the corresponding position in
the plink.cnv.summary.mperm file shows an empirical p-value
of 0.35, but of p=1 if adjusted for multiple testing (EMP2)
CHR SNP STAT EMP1 EMP2
... ... ... ... ...
1 pos586 0.419408 0.351324 1
... ... ... ... ...
Naturally, one would usually be more interested in following up significantly
associated regions of course... Nonetheless, if so desired we can see which
segments (given any the filtering specified) are spanning this position, with
--segment-spanning, which gives the following:
POOL FID IID PHE CHR BP1 BP2 KB TYPE SCORE
S1 PT-2378 PT-2378 2 12 16631570 16751087 119.517 DEL 10.23
S1 PT-268D PT-268D 2 12 16631494 16732162 100.668 DEL 9.3
S1 PT-2M8O PT-2M8O 1 12 16631441 16751082 119.641 DEL 31.23
S1 PT-2FZ9 PT-2FZ9 2 12 16631436 16751045 119.609 DEL 15.2
S1 PT-287D PT-287D 1 12 16616579 17183201 566.622 DUP 200.3
S1 PT-2C91 PT-2C91 2 12 16616579 16751045 134.466 DEL 14.3
S1 PT-28A8 PT-28A8 1 12 16616579 16751045 134.466 DEL 8.3
S1 PT-2FPB PT-2FPB 1 12 16616579 16714372 97.793 DEL 11.1
S1 PT-28IG PT-28IG 2 12 16616579 16708856 92.277 DEL 10.3
S1 PT-2E5N PT-2E5N 2 12 16614664 16715703 101.039 DEL 9.87
S1 PT-2FVL PT-2FVL 1 12 16614664 16751045 136.381 DEL 10.67
S1 PT-2DYE PT-2DYE 2 12 16614664 16715489 100.825 DEL 11.82
S1 PT-264I PT-264I 2 12 16614664 16751045 136.381 DEL 14.2
S1 PT-25WZ PT-25WZ 1 12 16591338 16715767 124.429 DEL 14.7
S1 CON 14 8:6 12 16631570 16708856 77.286 NA NA
S1 UNION 14 8:6 12 16591338 17183201 591.863 NA NA
For CNV data (in contrast to shared segments based on homozygosity or IBD sharing) the
extra fields of TYPE (deletion or duplication) and SCORE (some metric
of quality/confidence of CNV call) are also presented.
Here we see the 14 segments listed, 8 cases and 6
controls. The CON and UNION lines at the end of the
pool give the consensus region (i.e. shared by all segments) and the
total distance spanned by all. The PHE field gives the
phenotype for each individual.
Note that the way in which the dummy markers are selected will effectively mean that every
possibly unique position, in terms of counts of segments, is evaluated. The actual base
pair regions of any dummy marker is itself probably not of interest: given a sigificant (set of)
SNPs, the strategy would be to select any one and generate the corresponding pool to see
what and where the association maps to.
Format for common CNVs (generic variant format)
For common CNVs, that might also have meaningful allelic/SNP
variation, it can be desirable to represent and analyse these not as
segments. The rest of the page considers non-segmental specification
of CNVs: that is, copy-number variable specific genotype calls, such
as A or AAB.
Such data are represented with the generic variant file format,
and read into PLINK with the command:
plink --gfile mydata
where three files are assumed to exist
mydata.fam (describes individuals, as usual)
mydata.map (describes variants, as usual)
mydaya.gvar (new file format)
The .gvar file is in long-format: always with 7 fields, one
row per genotype (note that the reference to the first and second
parents above does not imply that paternal or maternal origin should
be known or is used)
FID Family ID
IID Individual ID (i.e. person should appear in .fam file)
NAME Variant name (should appear in .map file)
ALLELE1 Code for allele from first parent
DOSAGE1 Copy number for first allele
ALLELE2 Code for allele from second parent
DOSAGE2 Copy number for second allele
Some example of using this format to represent different genotypes are shown here:
1 1 var1 A 1 C 1 -> normal het
1 1 var2 A 2 C 1 -> AAC genotype
1 1 var3 0 1 0 1 -> missing individual
1 1 var4 0 0 0 0 -> homozygous deletion
1 1 var5 4 1 7 1 -> e.g. 4/7 genotype
2 1 var5 4 1 8 1 -> e.g. 4/8 genotype
1 1 var6 A 0.95 C 1.05 -> expected allele dosage (e.g. from imputation)
As currently implemented, all the codings below would be equivalent,
i.e. specifying an AA homozygote:
1 1 var7 A 1 A 1
2 1 var7 A 0 A 2
3 1 var7 A 2 A 0
4 1 var7 X 0 A 2
5 1 var7 0 0 A 2
That is, if there is 0 copy number, then it does not matter
what is specified in the allele field; otherwise, a value
of 0 in the allele field is taken to mean a missing
genotype (rather than a null/deleted genotype).
When loading this kind of file, PLINK will parse allelic and
copy number variation; currently by default it looks for integer
dosage calls in this part of the process. There are currently no
functions implemented yet for fractional counts, but the datatype
exists.
Alleles and CNVs are then appropriately counted. PLINK
assesses and records for each variant whether there is allelic and/or
copy number variation, and this influences downstream analysis.
Currently variation is defined as at least one individual varying, but
in the future thresholds will be added (e.g. to treat a site of a CNV
only if, say, 1% of all individuals have a non-canonical copy number).
The basic summary output is also in "long format": in the future this
will be expanded and reformated, e.g. to include specific allelic/CNV
frequncies or counts; stratification by phenotype, etc. This summary file
is called
plink.gvar.summary
and always contains three columns, as illustrated here
NAME FIELD VALUE
var1 CHR 1
var1 BP 1
var1 CNV yes
var1 ALLELIC yes
var1 GCOUNT 1000
var1 B 0.6031
var1 A 0.3969
var1 [2] 0.56
var1 [3] 0.378
var1 [4] 0.062
var1 B/B 30:38
var1 BB/B 66:60
var1 BB/BB 42:20
var1 A/B 142:101
var1 A/BB 161:91
var1 A/A 162:87
The CN counts are always in [x] to distingush from allele codes, if
they are also numeric. e.g. in this example, 37.5% of sample have the
deletion for example. There can be more than 2 CN states for a given
variant.
If the trait is binary, then the counts for copy-number specific
genotypes (e.g. A/BB will be given separately for cases and
controls, separated by a colon.
Association models for combined SNP and common CNV data
PLINK has implemented the following regression models
(logistic or linear) currently applicable to biallelic SNPs residing
within CNPs:
Y ~ b0 + b1.(A+B) + b2.(A-B)
When an association test is performed, extra lines will be appended to the plink.gvar.summary file
var1 B(SNP) -0.05955
var1 P(SNP) 0.09085
var1 B(CNP) 0.09314
var1 P(CNP) 0.3809
var1 B(CNP|SNP) 0.5638
var1 P(CNP|SNP) 0.0006768
var1 B(SNP|CNP) -0.2042
var1 P(SNP|CNP) 0.0002242
var1 P(SNP&CNP) 0.0007413
This section is not finished -- more details will be added online presently.
|
|