|
1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. Multimarker tests
15. Conditional haplotype tests
16. Proxy association
17. Full imputation (beta)
18. LD-based results clumping
19. Epistasis
20. Copy Number Variation
21. R-plugins
22. SNP annotation lookup
23. Simulation tools
24. Profile scoring
25. Resources
26. Miscellaneous
27. FAQ & Hints
28. gPLINK
|
|
R plugin functions
This page describes PLINK's limited support for R-based 'plug-in'
functions. In this manner, users can extend the basic functionality
of PLINK to better meet their own needs.
R is a powerful, freely-available
package for statistical computing. PLINK uses the Rserve package to communicate
with R. There are some notes on installing and running the Rserve
package below.
The idea is that some analyses, such as survival analysis for example, are
already implemented in R but not currently available in PLINK. Having a
simple interface for accessing such R functionality, allows one to benefit
from both the data-handling features of PLINK (i.e. it
being designed specifically to handle large SNP datasets efficiently, in a way
that the basic R package is not) as well as the ever-increasing library of
statistical tools in R. Also, this should provide an easy way to
prototype new methods, etc.
Currently there is only support for SNP-based analyses (i.e. there is a
strict model that assumes we return one test statistic value for each and
every SNP in the data set). Potentially (if there is interest/specific
suggestions) this will be expanded such that other entities can be the
unit of analysis, i.e. individuals, or groups of SNPs or individuals, or
even the results of PLINK analyses, for graphing or post-processing.
That is, in this first release of R plug-in functionality only the basic,
limited set of features is currently available, and there will likely
still be some rough edges...
Note Currently, there is only support for R-plugins
for Linux-based PLINK distributions.
Basic usage for R plug-ins
Assuming Rserve has been installed and is running locally
(see below) and that the file
myscript.R contains the R code conforming to the standard for a
PLINK plug-in (see here), then the command is simply
plink --file mydata --R myscript.R
which generates a file
plink.auto.R
Currently this file just contains the raw output for each SNP, giving
the value returned from R. Currently only a single value can be passed
back: that is, there is no support for extracting more information per
SNP. One approach it to explicitly write the desired information to a
file from the the R function you define (taking care not to overwrite
it, as the function is recalled for every batch of SNPs sent to R).
The assumed use is more to screen a large number of SNPs, then extract
ones of interest for more detailed follow-up analysis in R itself.
The value you return need not be a p-value however -- so you could
always set up functions to return other quantities (e.g. odds
ratios). However, they must be single, floating point values (i.e.
not strings or R objects).
Note Future developments will allow the plug-in
function to be embedded within permutations, gene-based tests, etc, in
the same way the basic -assoc command is.
Defining the R plug-in function
PLINK expects a function in the exact form
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
to be defined in the supplied file. This function is expected to return
a numeric vector, with as many elements are there are SNPs. Internally,
PLINK will call the Rplink function -- it must be written exactly
as shown here. The objects refer to:
PHENO vector of phenotypes (n)
GENO matrix of genotypes (n x l)
CLUSTER vector of cluster membership codes (n)
COVAR matrix of covariates (n x c)
where n is the number of individuals (after pruing, filtering,
etc) and c is the number of covariates (if any). PLINK
generates these objects internally, so the user can assume these exist
for when the Rplink() function is called. (In practice, the
number of SNPs, l will probably be smaller than the total number
of SNPs in the file, as PLINK passes the genotype data into R in batches
rather than all in one go).
Genotypes are coded 0, 1, 2 and NA, as per the --recodeA
option.
An example R plug-in is shown here -- this is probably the most
straightforward template for an R-plugin, in which the apply()
function is used to iteratively call the nested function (f1()),
once per SNP, in this case. For example, the file myscript.R
might contain the following plug-in:
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <-function(x) { 1 - mean(x, na.rm=T) / 2 }
as.numeric( apply(GENO, 2 , f1) )
}
If you are not familiar with the R language, there are a number of
excellent resources available from the main R webpage.
The first line defines a function, f1() that calculates the
allele frequency for each SNP (as the genotypes are coded 0,1,2 with 0 as
the minor allele). The second line applies this function to each column of
the genotype data, using the apply( data , row/col , function
) command.
Another, perhaps more useful, example is implementing survival analysis within PLINK: here
we define a function, f1() to return the p-value for the first coefficient; we assume
here that a censoring variable was loaded into PLINK as the first covariate (i.e. the R Surv
function takes two parameters, the survival time and censoring status). (This is probably not the optimal
way to implement this analysis, but is intended purely as an example of what can be done.)
library(survival)
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(s) {
m <- coxph( Surv( PHENO , COVAR[,1] ) ~ s )
summary(m)$coef[5]
}
apply( GENO , 2 , f1 )
}
Example of debugging an R plug-in
To generate a text file that contains the R commands PLINK would have
run (rather than actually trying to run them -- this is useful for
debugging purposes), add the following flag
plink --file mydata --R myscript.R --R-debug
To illustrate how to use the debug function to figure out why things
might not be working as expected, consider this example, in which we
are trying to use this approach to implement a basic logistic
regression. The file
mylog.R
which contains the function
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(s)
{
m <- glm( PHENO ~ s , family="binomial" )
summary(m)$coef[8]
}
apply( GENO , 2 , f1 )
}
and we have a dataset with three SNPs; the internal PLINK logistic regression command
plink --file mydata --logistic
yields
CHR SNP BP A1 TEST NMISS ODDS STAT P
1 snp0 10000 A ADD 200 1.256 1.15 0.2501
1 snp1 10001 B ADD 200 0.9028 -0.5112 0.6092
1 snp2 10002 B ADD 200 0.6085 -2.242 0.02499
Trying to run the R implementation:
plink --file mydata --R mylog.R
we obtain a set of invalid p-values in plink.auto.R
snp0 -1
snp1 -1
snp2 -1
To find out what is happening, we will run the same command with the debug option
plink --file mydata --R mylog.R --R-debug
This writes to the file plink.auto.R the actual commands that would be passed to
R, including the data and the function:
n <- 200
PHENO <- c( 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2,
1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2,
2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1,
2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1,
2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1,
1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1,
1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1,
1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1,
2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1 )
COVAR <- matrix( NA , nrow = n , ncol = 0 , byrow = T)
CLUSTER <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 )
l <- 3
g <- c( 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 1, 1, 0, 1, 2, 1, 1, 1,
2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 2, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 2, 2, 0, 1, 1, 2, 0, 1,
1, 1, 2, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 2, 1, 1, 1, 2, 1, 1, 0, 1, 1,
1, 1, 2, 2, 2, 1, 0, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
0, 2, 2, 1, 0, 0, 0, 1, 0, 1, 2, 2, 2, 1, 0, 0, 0, 2, 1, 2, 2, 1, 1,
1, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 1, 2, 2, 1, 2, 0, 1,
1, 1, 1, 2, 1, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1,
2, 2, 1, 2, 1, 1, 2, 2, 0, 0, 1, 2, 1, 0, 0, 1, 1, 2, 1, 2, 2, 2, 0,
1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0,
0, 1, 1, 2, 1, 0, 1, 2, 0, 2, 1, 1, 1, 0, 0, 2, 1, 1, 1, 2, 0, 1, 1,
1, 1, 1, 2, 1, 2, 0, 1, 1, 0, 1, 0, 2, 1, 0, 2, 1, 2, 2, 0, 0, 0, 1,
1, 2, 1, 1, 1, 0, 2, 1, 0, 2, 2, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1,
2, 2, 2, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 0, 0, 2, 0, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 0, 1, 1, 0, 2, 1, 1, 1, 2,
2, 2, 1, 1, 0, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 0,
0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 0, 1, 2, 0, 0, 1, 1, 1,
0, 1, 1, 1, 0, 0, 1, 1, 2, 1, 0, 1, 0, 2, 2, 1, 2, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2,
2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, 1, 1, 0,
1, 0, 1, 1, 0, 2, 0, 1, 2, 1, 1, 2, 2, 1, 2, 0, 2, 0, 2, 0, 0, 1, 1,
1, 1, 2, 1, 0, 2, 0, 1, 1, 0, 1, 2, 2, 2, 1, 0, 1, 2, 1, 2, 1, 2, 0,
0, 1, 0, 1, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 0, 2, 2, 2, 2, 2, 2, 1,
0, 2, 1, 2, 1, 1, 1, 1, 2, 0, 1, 1, 1, 2, 2, 1, 0, 1, 1, 2, 1, 1, 0,
1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 1, 2, 1,
0, 2, 0, 0, 1, 0, 2, 1, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2,
0, 1, 2, 1 )
GENO <- matrix( g , nrow = n ,byrow=T)
GENO[GENO == -1 ] <- NA
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
f1 <- function(s) {
m <- glm( PHENO-1 ~ s , family="binomial" )
summary(m)$coef[8]
}
apply( GENO , 2 , f1 )
}
In R, load this function
source("plink.auto.R")
and then try to run the Rplink function
Rplink(PHENO,GENO,CLUSTER,COVAR)
and you will see the error message
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
which indicates that R is expecting a 0/1 coding for this
particular function, not the default 1/2 coding used by PLINK for the
phenotype/dependent variable. You might therefore want to change the
relevant line of the function from
m <- glm( PHENO ~ s , family="binomial" )
to
m <- glm( PHENO==2 ~ s , family="binomial" )
for example. Then, repeating the above debug procedure, you would see in R
Rplink(PHENO,GENO,CLUSTER,COVAR)
gives
[1] 0.25013412 0.60921037 0.02499268
which are the correct p-values. So, now the function is fixed running
plink --file mydata --R mylog.R
would generate the same set of p-values as the PLINK
logistic command, in plink.auto.R
snp0 0.250134
snp1 0.60921
snp2 0.0249927
Setting up the Rserve package
First, you must ensure that you have Rserve installed on your
system. Normally, this will involve just typing, at the R command
prompt (not the system shell prompt)
install.packages("Rserve")
HINT For this to work, R must have been configured with
--enable-R-shlib.
When using any R-based PLINK plug-in, Rserve must be running in the
background before invoking the PLINK command. To start Rserve, just type
at the shell prompt
R CMD Rserve
(note, you may need to change Rserve to the full path of where
Rserve was installed), or, within R, type at the R prompt
library(Rserve)
Rserve()
Please see the Rserve
documentation for further support.
|
|