Handling and normalizing affymetrix data with bioconductor

Affymetrix microarray data normalization and quality assessment

Denis Puthier and Jacques van Helden

This tutorial is just a brief tour of the language capabilities and is intented to give some clues to begin with the R programming language. For a more detailled overview see R for beginners (E. Paradis)

Contents

  1. Bioconductor
  2. Installing bioconductor
  3. S4 objects
  4. The dataset from Den Boer (2009)
  5. Reading Affymetrix data
  6. Loading phenotypic data
  7. Affy library: graphics
  8. Quality control of raw data
  9. Present/absent calls
  10. Data normalization
  11. The ExpressionSet object
  12. Checking the normalization results
  13. Probe annotations
  14. Writing data onto disk
  15. References

Bioconductor

From Wikipedia:

Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

Most Bioconductor components are distributed as R packages, which are add-on modules for R. Initially most of the Bioconductor software packages focused on the analysis of single channel Affymetrix and two or more channel cDNA/Oligo microarrays. As the project has matured, the functional scope of the software packages broadened to include the analysis of all types of genomic data, such as SAGE, X-seq data (RNA-Seq, ChIP-Seq, ...), or SNP data.

The broad goals of the projects are to:

What is reproducible research ? How can R contributes to reproductibility ?

Some area covered by the Bioconductor project with some representative packages:

[back to contents]

Installing Bioconductor

To install bioconductor you need to retrieve the biocLite function from BioC web site. We will also check that some annotation packages for affymetrix geneChips are available on your computer. To use the command below you need to start R (type R within a terminal)

## Load the biocLite script from the Web
source("http://bioconductor.org/biocLite.R")

## Check if the affy library has already been installed in R. 
## If not, install the basic Bioconductor packages + affy
if(!require(affy)) {
   biocLite();
   biocLite("affy")
}

## Check if the geneplotter library has already been installed. 
## If not, install it. If yes, the require() function will simlpy load it.
if(!require(geneplotter)) { biocLite("geneplotter") }
[back to contents]

S4 objects

We have seen several classes of objects so far (vector, factor, matrix, data.frame...). In R, one can also create custom classes of objects in order to store and interrogate more complex objects.

Let say one need to store an experiment related to two-color microarrays. Then we have to store values from red and green channel for both foreground and background signal. We could also be interested in storing the symbols of the genes measured, the kind of microarray platform used, a description of the experiment (...). The interesting point is that only one instance from such a class would store all informations related to the experiment making it easier to manipulate and share.

Let us design such a class, we will call it microarrayBatch. We will use the setClass function that will allow to store the class definition.

#microarrayBatch
#"representation" corresponds to all attributes of the object
#"prototype" corresponds to default values

setClass("microarrayBatch",
    representation(
        R="matrix",
        G="matrix",
        Rb="matrix",
        Gb="matrix",
        phenotype="matrix",
        genes="character",
        description="character"),
        prototype=list(
        R=matrix(nr=0,nc=0),
        G=matrix(nr=0,nc=0),
        Rb=matrix(nr=0,nc=0),
        Gb=matrix(nr=0,nc=0),
        phenotype=matrix(nr=0,nc=0),
        description=new("character")
    )
)

Now that the classes is defined, we can create an instance of this class (an object). Inside R, this object is viewed as an S4 object of class "microarrayBatch". As any classical S4 object it contains a set of slots whose names can be accessed with the slotNames function.

myMA <- new("microarrayBatch")
isS4(myMA)
is(myMA)
slotNames(myMA)

The type of object stored in each slot can be accessed using the getClassDef function.

getClassDef("microarrayBatch")

Let us store an artificial data set with two microarrays, each containing 10 genes.

myMA@R<-matrix(rnorm(20),nc=2)
myMA@G<-matrix(rnorm(20),nc=2)
myMA@Rb<-matrix(rnorm(20),nc=2)
myMA@Gb<-matrix(rnorm(20),nc=2)
myMA@description <- "an artificial experiment"
myMA

As every S4 objects, each slot can be accessed using the @ operator:

myMA@R
myMA@G

We can link functions (called methods) to this object. For instance we can define a method getGreen() for the class microarrayBatch. This will retrieve the data stored in slot G (red channel of the two-color microarray).

## Define a function getGreen(), that retrieves the data stored in 
## slot G (green channel).
if(!isGeneric("getGreen"))
        setGeneric("getGreen", function(object)
                standardGeneric("getGreen")
        )

setMethod("getGreen",signature("microarrayBatch"),
	function(object){
	return(object@G)
})
Now let's call this function
getGreen(myMA)

We can check that the function returns, as expected, the content of the slot G of our microarrayBatch object.

getGreen(myMA) == myMA@G

As shown in this example we can easily define new object and methods within R. This S4 formalism is used throughout bioconductor project.

[back to contents]

The dataset from Den Boer (2009)

Here we will use a subset of the GSE13425 experiment which which can be retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) characterized by an abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.

[back to contents]

Reading Affymetrix data

Retrieving data

View solution| Hide solution

Note: we won't perform pre-processing of the full dataset due to memory and time issues.

Loading data into R

View solution| Hide solution

View solution| Hide solution

[back to contents]

Loading phenotypic data

By default the ReadAffy function does not load phenotypic data. They can be load using the read.AnnotatedDataFrame function that will return an object of class AnnotatedDataFrame.

Given that the phenoData slot of affy.s13 (our instance of class AffyBatch) is also an AnnotatedDataFrame assign the result of read.AnnotatedDataFrame to our affy.s13 object.
View solution| Hide solution [back to contents]

Indexing an affyBatch object

The indexing operator '[' (which in fact is a function...), is also re-defined in the source code of the affy library. The code stipulates that the indexing function will always return an AffyBatch object. In the following example when selecting two microarrays, we also select both the expression values and the corresponding phenotypic data

affy.s2<-affy.s13[,1:2]
pData(affy.s2)
[back to contents]

Affy library: graphics

The image function

View solution| Hide solution

The barplot.ProbeSet() function

The probeSet names can be accessed through the geneNames function.

gn <- geneNames(affy.s13)
is(gn)
gn[1:10]

Note: the method geneNames() returns probeset identifiers rather than actual "gene names".

Given one or several probeSet IDs, the probeset method allows one to extract the corresponding probe expression values.

probeset(affy.s13,"221798_x_at") # warning, this returns a list !
View solution| Hide solution

[back to contents]

Quality control of raw data

Descriptive statistics

View solution| Hide solution

AffyRNAdeg

The box plots and histograms generated above indicate the global distribution of intensity values for all probes. A well-known pittfall of Affymetrix technology is the degradation effect: for a given gene, the intensity tend to decrease from the distalmost (3') to the less distal (5') probes. The affy library implements a specific quality control criteria, enabling to plot the changes in mean intensities from 5' to 3' probes (AffyRNAdeg function).

degradation <-  AffyRNAdeg(affy.s13)
?AffyRNAdeg
plotAffyRNAdeg(degradation, panel.first=grid())
[back to contents]

Present/absent calls

It is most generally important to select a set of genes that are above the background in at least a given number of samples. The affymetrix reference method allows one to compute for each probeSet a Absent/Marginal/Present call (A/M/P). However, this method is based on the comparison of signals emitted by PM and MM (that tend to follow the PM signal). This function is implemented in mas5calls function (as it was originally part of the MAS5 normalization algorithm).

## Compute the A/M/P calls
ap <- mas5calls(affy.s13)
is(ap)

## Retrieve the A/M/P values from ap (we don't need the other attributes).
## These values are accessible via the same function exprs() used above
## to get expression values from the affy object.
ap <- exprs(ap)

## Compute a cross-table indicating the number of A, M and P calls for
## each sample.
ap.table <- apply(ap,2,table)
print(ap.table)

## Compute the relative frequencies of A, M and P calls per sample
ap.freq.table <- ap.table / apply(ap.table, 2, sum)
print(ap.freq.table)

## Draw a barplot indicating the relative frequencies of A, M and P calls per sample.
barplot(ap.table/nrow(ap),las=2, cex.names=0.55, main="% of A/M/P calls")

## Let us add a legend besides the barplots
barplot(ap.table/nrow(ap),las=2, cex.names=0.55, main="% of A/M/P calls", legend=rownames(ap.table), xlim=c(1,18))
[back to contents]

Data normalization

Numerous methods have been proposed for affymetrix data normalization (mas5, PLIER, Li-Wong, rma, gcrma,...). These methods rely on elaborate treatment, including inter-sample normalization. A detailed description and comparison of these methods is out of scope for his course. For this practical, we will use the (rma()) function. Note that RMA normalization includes a log2 transformation of the raw data.

eset <- rma(affy.s13)
View solution| Hide solution

[back to contents]

The ExpressionSet object

The ExpressionSet class is central to BioC as lots of packages converge to produce ExpressionSet instances. This simple object is intended to store normalized data from various technologies.

[back to contents]

Checking the normalization results

Relative Log Expression (RLE)

One can use classical diagram to visualize the normalization results. Another solution to check the normalization of an expression matrix is to use the Relative Log Expression (RLE) plot.

## Compare the expression values before and after normalization
x11(width=8, height=10) ## Open a graphical window with specific dimensions
par(mfrow=c(2,1)) ## Share this window between two plots
boxplot(log2(exprs(affy.s13)), pch=".", las=2, cex.axis=0.5, main="Before normalization (log2-transformed) - Probe level")
grid()
boxplot(exprs(eset), pch=".", las=2, cex.axis=0.5, main="RMA-normalized - Probeset level")
grid()

## Plot the density distributions before and after normalization
plotDensity(log2(exprs(affy.s13)), main="Before normalization (log2-transformed) - Probe level", xlim=c(0,16)); grid()
plotDensity(exprs(eset), main="RMA-normalized - Probeset level", xlim=c(0,16)); grid()

par(mfrow=c(1,1)) ## Restore single plot per page

## Compute a median for each row (probe)
m <- rowMedians(exprs(eset))

## Centring: substract the median value of each probe
rle <- sweep(exprs(eset),1, m, "-")

## Plot a box of the probe-wise centred values
x11(width=8, height=8)
boxplot(rle, pch=".", las=2, cex.axis=0.5)
View solution| Hide solution

MA plot diagram

One popular diagram in dna chip analysis is the M versus A plot (MA plot). In this diagram:

Would data be perfectly normalized, M value should not depend on A values. To represent the MA plot we will first compute values for a pseudo-microarray that will be the reference. This pseudo-microarray will be highly representative of the series as it will contain the median expression values for each gene.

ref<-rowMedians(exprs(eset))

  • Calculate A1..n and M1..n for sample 1 versus ref given that for each gene g with intensities Ig,1 et Ig,ref M and A can be computed as follows:
  • A=(Ig,1+Ig,ref)/2
    M=Ig,1−Ig,ref
  • Using the abline function (h argument) add the line M=1, M=-1 and M=0.
  • Using the text function display the names of probesets for which the absolute value of the ratio is above 4.
  • View solution| Hide solution

    [back to contents]

    Probe annotations

    As you have probably noticed, the gene names are neither available in the affyBatch object nor in the eset object. Each affymetrix microarray has its own annotation library that can be used to link probesets to genes Symbol and retrieve additional information about genes. Here we need to load the hgu133a.db library. If it is not previously install, use the biocLite function.

    source("http://www.bioconductor.org/biocLite.R")
    if(!require(hgu133a.db))      biocLite("hgu133a.db")
    

    This library give access to a set of annotation sources that can be listed using the hgu133a function.

    library(hgu133a.db)
    hgu133a()
    

    The following commands can be used to retrieve gene Symbols for the hgu133a geneChip.

    ## Beware: the funciton geneNames returns probe IDs, and NOT gene names
    gn<-geneNames(affy.s13)
    head(gn)
    gn[1]
    
    ##  What biologists call "gene names" is called  "gene symbols" in Bioconductor.
    mget(gn[1:4],env=hgu133aSYMBOL)
    
    
    ## Collect a vector with the correspondences between probeset IDs 
    ## and gene symbols (names).
    probesets.to.genes <- unlist(mget(gn,env=hgu133aSYMBOL))
    head(probesets.to.genes)
    
    ## Let us try to identify the probesets associated to a given gene of interest, for example STAT1
    probesets.to.genes.selected <- grep("STAT1", probesets.to.genes, val=TRUE)
    print(probesets.to.genes.selected)
    
    ## Interpretation: the grep returned a vector where values are the 
    ## matching gene symbols (names), and entry names are the probeset IDs. 
    probesets.selected <- names(probesets.to.genes.selected)
    probesets.selected ## Print the result
    
    ## We notice that STAT1 is represented by several probesets on the 
    ## hgu133a gene chip. Some other genes (the majority) are represented 
    ## by a single probeset.
    names(grep("PAX6", probesets.to.genes, val=TRUE))
    
    ## Count the number of probesets for each gene symbol (name) 
    ## on the hgu133a mircoarray.
    probesets.per.gene <- table(probesets.to.genes)
    head(probesets.per.gene, n=30)
    
    ## Show the most represented gens
    head(sort(probesets.per.gene, decreasing=TRUE), n=30)
    
    ## Plot an histogram with the number of probesets per gene
    hist(probesets.per.gene, breaks=seq(from=0.5, to=13.5, by=1))
    

  • Create a new object, m, that will contain the normalized expression matrix.
  • Change the row names (rownames function) of m so that they will contain both the probe names and gene symbols (use the paste function with "|" as separator).
  • View solution| Hide solution

    [back to contents]

    Writing data onto disk

    R objects can be saved using the save function (then subsequently load using the load function). For a tab-delimited file output one may use the write.table function.

    ## Save the R object "eset" as a whole
    save(eset,file="eset.Rdata")
    
    ## Save the expression matrix in a tab-delimited file, that can
    ## further be opened with a spreadsheet (Excel, OpenOffice calc, ...).
    write.table(expr.matrix,"GSE3254_norm.txt",sep="\t",col.names=NA, quote=F)
    
    [back to contents]

    Additional exercices

    Using boxplot and densities, compare the effect on raw pm data of quantile normalization vs median centering, median-centering and scaling, and median-centering and scaling with mad.

    References

      Den Boer et al. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol (2009) vol. 10 (2) pp. 125-34.