PLINK: Whole genome data analysis toolset plink...
Latest PLINK release is v1.03 (10-Jun-2008)

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | Haplotypes | Conditional tests | Proxy association | Imputation | Clumping | Epistasis | Copy Number Variation | R-plugins | SNP annotation | Simulation | Profiles | Resources | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. Multimarker tests 15. Conditional haplotype tests 16. Proxy association 17. Full imputation (beta) 18. LD-based results clumping 19. Epistasis 20. Copy Number Variation 21. R-plugins 22. SNP annotation lookup 23. Simulation tools 24. Profile scoring 25. Resources 26. Miscellaneous 27. FAQ & Hints

28. gPLINK
 

R plugin functions

This page describes PLINK's limited support for R-based 'plug-in' functions. In this manner, users can extend the basic functionality of PLINK to better meet their own needs.

R is a powerful, freely-available package for statistical computing. PLINK uses the Rserve package to communicate with R. There are some notes on installing and running the Rserve package below.

The idea is that some analyses, such as survival analysis for example, are already implemented in R but not currently available in PLINK. Having a simple interface for accessing such R functionality, allows one to benefit from both the data-handling features of PLINK (i.e. it being designed specifically to handle large SNP datasets efficiently, in a way that the basic R package is not) as well as the ever-increasing library of statistical tools in R. Also, this should provide an easy way to prototype new methods, etc.

Currently there is only support for SNP-based analyses (i.e. there is a strict model that assumes we return one test statistic value for each and every SNP in the data set). Potentially (if there is interest/specific suggestions) this will be expanded such that other entities can be the unit of analysis, i.e. individuals, or groups of SNPs or individuals, or even the results of PLINK analyses, for graphing or post-processing.

That is, in this first release of R plug-in functionality only the basic, limited set of features is currently available, and there will likely still be some rough edges...

Note Currently, there is only support for R-plugins for Linux-based PLINK distributions.

Basic usage for R plug-ins

Assuming Rserve has been installed and is running locally (see below) and that the file myscript.R contains the R code conforming to the standard for a PLINK plug-in (see here), then the command is simply

plink --file mydata --R myscript.R

which generates a file
     plink.auto.R
Currently this file just contains the raw output for each SNP, giving the value returned from R. Currently only a single value can be passed back: that is, there is no support for extracting more information per SNP. One approach it to explicitly write the desired information to a file from the the R function you define (taking care not to overwrite it, as the function is recalled for every batch of SNPs sent to R).

The assumed use is more to screen a large number of SNPs, then extract ones of interest for more detailed follow-up analysis in R itself. The value you return need not be a p-value however -- so you could always set up functions to return other quantities (e.g. odds ratios). However, they must be single, floating point values (i.e. not strings or R objects).

Note Future developments will allow the plug-in function to be embedded within permutations, gene-based tests, etc, in the same way the basic -assoc command is.

Defining the R plug-in function

PLINK expects a function in the exact form
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
to be defined in the supplied file. This function is expected to return a numeric vector, with as many elements are there are SNPs. Internally, PLINK will call the Rplink function -- it must be written exactly as shown here. The objects refer to:
     PHENO     vector of phenotypes (n)
     GENO      matrix of genotypes (n x l)
     CLUSTER   vector of cluster membership codes (n)
     COVAR     matrix of covariates (n x c)
where n is the number of individuals (after pruing, filtering, etc) and c is the number of covariates (if any). PLINK generates these objects internally, so the user can assume these exist for when the Rplink() function is called. (In practice, the number of SNPs, l will probably be smaller than the total number of SNPs in the file, as PLINK passes the genotype data into R in batches rather than all in one go).

Genotypes are coded 0, 1, 2 and NA, as per the --recodeA option.

An example R plug-in is shown here -- this is probably the most straightforward template for an R-plugin, in which the apply() function is used to iteratively call the nested function (f1()), once per SNP, in this case. For example, the file myscript.R might contain the following plug-in:
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{
 f1 <-function(x) { 1 - mean(x, na.rm=T) / 2 }
 as.numeric( apply(GENO, 2 , f1) )
}
If you are not familiar with the R language, there are a number of excellent resources available from the main R webpage.

The first line defines a function, f1() that calculates the allele frequency for each SNP (as the genotypes are coded 0,1,2 with 0 as the minor allele). The second line applies this function to each column of the genotype data, using the apply( data , row/col , function ) command.

Another, perhaps more useful, example is implementing survival analysis within PLINK: here we define a function, f1() to return the p-value for the first coefficient; we assume here that a censoring variable was loaded into PLINK as the first covariate (i.e. the R Surv function takes two parameters, the survival time and censoring status). (This is probably not the optimal way to implement this analysis, but is intended purely as an example of what can be done.)
library(survival)
Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
{

f1 <- function(s) {
   m <- coxph( Surv( PHENO , COVAR[,1] ) ~ s )
   summary(m)$coef[5]
 }

apply( GENO , 2 , f1 )
}

Example of debugging an R plug-in

To generate a text file that contains the R commands PLINK would have run (rather than actually trying to run them -- this is useful for debugging purposes), add the following flag

plink --file mydata --R myscript.R --R-debug

To illustrate how to use the debug function to figure out why things might not be working as expected, consider this example, in which we are trying to use this approach to implement a basic logistic regression. The file
     mylog.R
which contains the function
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
     {
       f1 <- function(s) 
       {
        m <- glm( PHENO ~ s , family="binomial" )
        summary(m)$coef[8]
       }
      apply( GENO , 2 , f1 )
     }
and we have a dataset with three SNPs; the internal PLINK logistic regression command
plink --file mydata --logistic

yields
     CHR  SNP         BP   A1       TEST    NMISS       ODDS         STAT            P
       1 snp0      10000    A        ADD      200      1.256         1.15       0.2501
       1 snp1      10001    B        ADD      200     0.9028      -0.5112       0.6092
       1 snp2      10002    B        ADD      200     0.6085       -2.242      0.02499
Trying to run the R implementation:
plink --file mydata --R mylog.R

we obtain a set of invalid p-values in plink.auto.R
     snp0 -1
     snp1 -1
     snp2 -1
To find out what is happening, we will run the same command with the debug option
plink --file mydata --R mylog.R --R-debug

This writes to the file plink.auto.R the actual commands that would be passed to R, including the data and the function:
     n <- 200
     
     PHENO <- c( 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2,
     1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2,
     2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1,
     2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1,
     2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1,
     1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1,
     1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1,
     1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1,
     2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1 )
     
     COVAR <- matrix( NA , nrow = n , ncol = 0 , byrow = T)
     
     CLUSTER <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) 
     
     l <- 3 
     
     g <- c( 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 1, 1, 0, 1, 2, 1, 1, 1,
     2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 2, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
     1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 2, 2, 0, 1, 1, 2, 0, 1,
     1, 1, 2, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 2, 1, 1, 1, 2, 1, 1, 0, 1, 1,
     1, 1, 2, 2, 2, 1, 0, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
     0, 2, 2, 1, 0, 0, 0, 1, 0, 1, 2, 2, 2, 1, 0, 0, 0, 2, 1, 2, 2, 1, 1,
     1, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 0, 1, 2, 2, 1, 2, 2, 1, 2, 0, 1,
     1, 1, 1, 2, 1, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1,
     2, 2, 1, 2, 1, 1, 2, 2, 0, 0, 1, 2, 1, 0, 0, 1, 1, 2, 1, 2, 2, 2, 0,
     1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0,
     0, 1, 1, 2, 1, 0, 1, 2, 0, 2, 1, 1, 1, 0, 0, 2, 1, 1, 1, 2, 0, 1, 1,
     1, 1, 1, 2, 1, 2, 0, 1, 1, 0, 1, 0, 2, 1, 0, 2, 1, 2, 2, 0, 0, 0, 1,
     1, 2, 1, 1, 1, 0, 2, 1, 0, 2, 2, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1,
     2, 2, 2, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 0, 0, 2, 0, 2, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 0, 1, 1, 0, 2, 1, 1, 1, 2,
     2, 2, 1, 1, 0, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 0,
     0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 0, 1, 2, 0, 0, 1, 1, 1,
     0, 1, 1, 1, 0, 0, 1, 1, 2, 1, 0, 1, 0, 2, 2, 1, 2, 1, 1, 1, 0, 1, 1,
     1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2,
     2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, 1, 1, 0,
     1, 0, 1, 1, 0, 2, 0, 1, 2, 1, 1, 2, 2, 1, 2, 0, 2, 0, 2, 0, 0, 1, 1,
     1, 1, 2, 1, 0, 2, 0, 1, 1, 0, 1, 2, 2, 2, 1, 0, 1, 2, 1, 2, 1, 2, 0,
     0, 1, 0, 1, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 0, 2, 2, 2, 2, 2, 2, 1,
     0, 2, 1, 2, 1, 1, 1, 1, 2, 0, 1, 1, 1, 2, 2, 1, 0, 1, 1, 2, 1, 1, 0,
     1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 1, 2, 1,
     0, 2, 0, 0, 1, 0, 2, 1, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2,
     0, 1, 2, 1 ) 
     
     GENO <- matrix( g , nrow = n ,byrow=T) 
     GENO[GENO == -1 ] <- NA
          
     Rplink <- function(PHENO,GENO,CLUSTER,COVAR)
     {
     f1 <- function(s) {
        m <- glm( PHENO-1 ~ s , family="binomial" )
        summary(m)$coef[8]
      }
     apply( GENO , 2 , f1 )
     }
In R, load this function
source("plink.auto.R")

and then try to run the Rplink function
Rplink(PHENO,GENO,CLUSTER,COVAR)

and you will see the error message
     Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
which indicates that R is expecting a 0/1 coding for this particular function, not the default 1/2 coding used by PLINK for the phenotype/dependent variable. You might therefore want to change the relevant line of the function from
     m <- glm( PHENO ~ s , family="binomial" )
to
     m <- glm( PHENO==2 ~ s , family="binomial" )
for example. Then, repeating the above debug procedure, you would see in R
Rplink(PHENO,GENO,CLUSTER,COVAR)

gives
     [1] 0.25013412 0.60921037 0.02499268
which are the correct p-values. So, now the function is fixed running
plink --file mydata --R mylog.R

would generate the same set of p-values as the PLINK logistic command, in plink.auto.R
     snp0 0.250134
     snp1 0.60921
     snp2 0.0249927

Setting up the Rserve package

First, you must ensure that you have Rserve installed on your system. Normally, this will involve just typing, at the R command prompt (not the system shell prompt)
install.packages("Rserve")

HINT For this to work, R must have been configured with --enable-R-shlib.

When using any R-based PLINK plug-in, Rserve must be running in the background before invoking the PLINK command. To start Rserve, just type at the shell prompt
R CMD Rserve

(note, you may need to change Rserve to the full path of where Rserve was installed), or, within R, type at the R prompt
library(Rserve)
Rserve()

Please see the Rserve documentation for further support.
 
This document last modified Wednesday, 11-Jun-2008 18:25:10 EDT