Denis Puthier and Jacques van Helden
This tutorial is just a brief tour of the language capabilities and is intented to give some clues to begin with the R programming language. For a more detailled overview see R for beginners (E. Paradis)
From Wikipedia:
Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
Most Bioconductor components are distributed as R packages, which are add-on modules for R. Initially most of the Bioconductor software packages focused on the analysis of single channel Affymetrix and two or more channel cDNA/Oligo microarrays. As the project has matured, the functional scope of the software packages broadened to include the analysis of all types of genomic data, such as SAGE, X-seq data (RNA-Seq, ChIP-Seq, ...), or SNP data.
The broad goals of the projects are to:
- Provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
- Facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. annotation data from UCSC or GO database.
- Provide a common software platform that enables the rapid development and deployment of plug-able, scalable, and interoperable software.
- Further scientific understanding by producing high-quality documentation and reproducible research.
Some area covered by the Bioconductor project with some representative packages:
To install bioconductor you need to retrieve the biocLite function from BioC web site. We will also check that some annotation packages for affymetrix geneChips are available on your computer. To use the command below you need to start R (type R within a terminal)
## Load the biocLite script from the Web source("http://bioconductor.org/biocLite.R") ## Check if the affy library has already been installed in R. ## If not, install the basic Bioconductor packages + affy if(!require(affy)) { biocLite(); biocLite("affy") } ## Check if the geneplotter library has already been installed. ## If not, install it. If yes, the require() function will simlpy load it. if(!require(geneplotter)) { biocLite("geneplotter") }[back to contents]
We have seen several classes of objects so far (vector, factor, matrix, data.frame...). In R, one can also create custom classes of objects in order to store and interrogate more complex objects.
Let say one need to store an experiment related to two-color microarrays. Then we have to store values from red and green channel for both foreground and background signal. We could also be interested in storing the symbols of the genes measured, the kind of microarray platform used, a description of the experiment (...). The interesting point is that only one instance from such a class would store all informations related to the experiment making it easier to manipulate and share.
Let us design such a class, we will call it microarrayBatch. We will use the setClass function that will allow to store the class definition.
#microarrayBatch #"representation" corresponds to all attributes of the object #"prototype" corresponds to default values setClass("microarrayBatch", representation( R="matrix", G="matrix", Rb="matrix", Gb="matrix", phenotype="matrix", genes="character", description="character"), prototype=list( R=matrix(nr=0,nc=0), G=matrix(nr=0,nc=0), Rb=matrix(nr=0,nc=0), Gb=matrix(nr=0,nc=0), phenotype=matrix(nr=0,nc=0), description=new("character") ) )
Now that the classes is defined, we can create an instance of this class (an object). Inside R, this object is viewed as an S4 object of class "microarrayBatch". As any classical S4 object it contains a set of slots whose names can be accessed with the slotNames function.
myMA <- new("microarrayBatch") isS4(myMA) is(myMA) slotNames(myMA)
The type of object stored in each slot can be accessed using the getClassDef function.
getClassDef("microarrayBatch")
Let us store an artificial data set with two microarrays, each containing 10 genes.
myMA@R<-matrix(rnorm(20),nc=2) myMA@G<-matrix(rnorm(20),nc=2) myMA@Rb<-matrix(rnorm(20),nc=2) myMA@Gb<-matrix(rnorm(20),nc=2) myMA@description <- "an artificial experiment" myMA
As every S4 objects, each slot can be accessed using the @ operator:
myMA@R myMA@G
We can link functions (called methods) to this object. For instance we can define a method getGreen() for the class microarrayBatch. This will retrieve the data stored in slot G (red channel of the two-color microarray).
## Define a function getGreen(), that retrieves the data stored in ## slot G (green channel). if(!isGeneric("getGreen")) setGeneric("getGreen", function(object) standardGeneric("getGreen") ) setMethod("getGreen",signature("microarrayBatch"), function(object){ return(object@G) })Now let's call this function
getGreen(myMA)
We can check that the function returns, as expected, the content of the slot G of our microarrayBatch object.
getGreen(myMA) == myMA@G
As shown in this example we can easily define new object and methods within R. This S4 formalism is used throughout bioconductor project.
[back to contents]Here we will use a subset of the GSE13425 experiment which which can be retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) characterized by an abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.
Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.
Note: we won't perform pre-processing of the full dataset due to memory and time issues.
library(affy)
By default the ReadAffy function does not load phenotypic data. They can be load using the read.AnnotatedDataFrame function that will return an object of class AnnotatedDataFrame.
The indexing operator '[' (which in fact is a function...), is also re-defined in the source code of the affy library. The code stipulates that the indexing function will always return an AffyBatch object. In the following example when selecting two microarrays, we also select both the expression values and the corresponding phenotypic data
affy.s2<-affy.s13[,1:2] pData(affy.s2)[back to contents]
The probeSet names can be accessed through the geneNames function.
gn <- geneNames(affy.s13) is(gn) gn[1:10]
Note: the method geneNames() returns probeset identifiers rather than actual "gene names".
Given one or several probeSet IDs, the probeset method allows one to extract the corresponding probe expression values.
probeset(affy.s13,"221798_x_at") # warning, this returns a list !
The box plots and histograms generated above indicate the global distribution of intensity values for all probes. A well-known pittfall of Affymetrix technology is the degradation effect: for a given gene, the intensity tend to decrease from the distalmost (3') to the less distal (5') probes. The affy library implements a specific quality control criteria, enabling to plot the changes in mean intensities from 5' to 3' probes (AffyRNAdeg function).
degradation <- AffyRNAdeg(affy.s13) ?AffyRNAdeg plotAffyRNAdeg(degradation, panel.first=grid())[back to contents]
It is most generally important to select a set of genes that are above the background in at least a given number of samples. The affymetrix reference method allows one to compute for each probeSet a Absent/Marginal/Present call (A/M/P). However, this method is based on the comparison of signals emitted by PM and MM (that tend to follow the PM signal). This function is implemented in mas5calls function (as it was originally part of the MAS5 normalization algorithm).
## Compute the A/M/P calls ap <- mas5calls(affy.s13) is(ap) ## Retrieve the A/M/P values from ap (we don't need the other attributes). ## These values are accessible via the same function exprs() used above ## to get expression values from the affy object. ap <- exprs(ap) ## Compute a cross-table indicating the number of A, M and P calls for ## each sample. ap.table <- apply(ap,2,table) print(ap.table) ## Compute the relative frequencies of A, M and P calls per sample ap.freq.table <- ap.table / apply(ap.table, 2, sum) print(ap.freq.table) ## Draw a barplot indicating the relative frequencies of A, M and P calls per sample. barplot(ap.table/nrow(ap),las=2, cex.names=0.55, main="% of A/M/P calls") ## Let us add a legend besides the barplots barplot(ap.table/nrow(ap),las=2, cex.names=0.55, main="% of A/M/P calls", legend=rownames(ap.table), xlim=c(1,18))[back to contents]
Numerous methods have been proposed for affymetrix data normalization (mas5, PLIER, Li-Wong, rma, gcrma,...). These methods rely on elaborate treatment, including inter-sample normalization. A detailed description and comparison of these methods is out of scope for his course. For this practical, we will use the (rma()) function. Note that RMA normalization includes a log2 transformation of the raw data.
eset <- rma(affy.s13)
The ExpressionSet class is central to BioC as lots of packages converge to produce ExpressionSet instances. This simple object is intended to store normalized data from various technologies.
[back to contents]One can use classical diagram to visualize the normalization results. Another solution to check the normalization of an expression matrix is to use the Relative Log Expression (RLE) plot.
## Compare the expression values before and after normalization x11(width=8, height=10) ## Open a graphical window with specific dimensions par(mfrow=c(2,1)) ## Share this window between two plots boxplot(log2(exprs(affy.s13)), pch=".", las=2, cex.axis=0.5, main="Before normalization (log2-transformed) - Probe level") grid() boxplot(exprs(eset), pch=".", las=2, cex.axis=0.5, main="RMA-normalized - Probeset level") grid() ## Plot the density distributions before and after normalization plotDensity(log2(exprs(affy.s13)), main="Before normalization (log2-transformed) - Probe level", xlim=c(0,16)); grid() plotDensity(exprs(eset), main="RMA-normalized - Probeset level", xlim=c(0,16)); grid() par(mfrow=c(1,1)) ## Restore single plot per page ## Compute a median for each row (probe) m <- rowMedians(exprs(eset)) ## Centring: substract the median value of each probe rle <- sweep(exprs(eset),1, m, "-") ## Plot a box of the probe-wise centred values x11(width=8, height=8) boxplot(rle, pch=".", las=2, cex.axis=0.5)
One popular diagram in dna chip analysis is the M versus A plot (MA plot). In this diagram:
M is the log intensity ratio calculated for any gene.
A is the average log intensity which corresponds to an estimate of the gene expression level.
Would data be perfectly normalized, M value should not depend on A values. To represent the MA plot we will first compute values for a pseudo-microarray that will be the reference. This pseudo-microarray will be highly representative of the series as it will contain the median expression values for each gene.
ref<-rowMedians(exprs(eset))
As you have probably noticed, the gene names are neither available in the affyBatch object nor in the eset object. Each affymetrix microarray has its own annotation library that can be used to link probesets to genes Symbol and retrieve additional information about genes. Here we need to load the hgu133a.db library. If it is not previously install, use the biocLite function.
source("http://www.bioconductor.org/biocLite.R") if(!require(hgu133a.db)) biocLite("hgu133a.db")
This library give access to a set of annotation sources that can be listed using the hgu133a function.
library(hgu133a.db) hgu133a()
The following commands can be used to retrieve gene Symbols for the hgu133a geneChip.
## Beware: the funciton geneNames returns probe IDs, and NOT gene names gn<-geneNames(affy.s13) head(gn) gn[1] ## What biologists call "gene names" is called "gene symbols" in Bioconductor. mget(gn[1:4],env=hgu133aSYMBOL) ## Collect a vector with the correspondences between probeset IDs ## and gene symbols (names). probesets.to.genes <- unlist(mget(gn,env=hgu133aSYMBOL)) head(probesets.to.genes) ## Let us try to identify the probesets associated to a given gene of interest, for example STAT1 probesets.to.genes.selected <- grep("STAT1", probesets.to.genes, val=TRUE) print(probesets.to.genes.selected) ## Interpretation: the grep returned a vector where values are the ## matching gene symbols (names), and entry names are the probeset IDs. probesets.selected <- names(probesets.to.genes.selected) probesets.selected ## Print the result ## We notice that STAT1 is represented by several probesets on the ## hgu133a gene chip. Some other genes (the majority) are represented ## by a single probeset. names(grep("PAX6", probesets.to.genes, val=TRUE)) ## Count the number of probesets for each gene symbol (name) ## on the hgu133a mircoarray. probesets.per.gene <- table(probesets.to.genes) head(probesets.per.gene, n=30) ## Show the most represented gens head(sort(probesets.per.gene, decreasing=TRUE), n=30) ## Plot an histogram with the number of probesets per gene hist(probesets.per.gene, breaks=seq(from=0.5, to=13.5, by=1))
R objects can be saved using the save function (then subsequently load using the load function). For a tab-delimited file output one may use the write.table function.
## Save the R object "eset" as a whole save(eset,file="eset.Rdata") ## Save the expression matrix in a tab-delimited file, that can ## further be opened with a spreadsheet (Excel, OpenOffice calc, ...). write.table(expr.matrix,"GSE3254_norm.txt",sep="\t",col.names=NA, quote=F)[back to contents]
Using boxplot and densities, compare the effect on raw pm data of quantile normalization vs median centering, median-centering and scaling, and median-centering and scaling with mad.