Here we will use data from the microarray series GSE13425, which which was retrieved from the Gene Expression Omnibus (GEO) database. In this experiment, the authors applied a supervised classification method to define a transcriptomic signature, in order to classify samples from acute lymphoblastic leukemia (ALL). Lymphoblastic leukemia is characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.
Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Information related to this platform are available on GEO website under identifier GPL96.
We will now load three data tables into R using the read.table function. The function allows us to directly read the tables from the web server. We will successively load 3 files providing complementary information.
## Get some help about the read.table fonction
#?read.table
## Define the URL of the example data sets
url.course <- "http://pedagogix-tagc.univ-mrs.fr/courses/ASG1"
url.base <- file.path(url.course, "data/marrays/")
## Load expression values
expr.file <- file.path(url.base, "GSE13425_Norm_Whole.txt")
expr.matrix <- read.table(expr.file,sep="\t", head=T, row=1)
## Load phenotypic data
pheno <- read.table(file.path(url.base, 'phenoData_GSE13425.tab'),
sep='\t', head=TRUE, row=1)
## Load Absent/Marginal/Present (AMP) calls
amp <- read.table(file.path(url.base, "GSE13425_AMP_Whole.txt"),
sep="\t", head=T, row=1)
We will now define a directory to store the results on our computer.
## Define the output directory. You can adapt this to your local configuration.
dir.output <- "~/ASG1_practicals/GSE13425"
## Create the output directory (if it does not exist yet)
dir.create(dir.output, showWarnings=FALSE, recurs=TRUE)
## Change directory to dir.output
setwd(dir.output)
Welch’s test is a variant of the classical Student test, whose goal is to test the equality between two means.
\[H_0: m_{g,1} = m_{g,2}\]
where \(m_{g,1}\) and \(m_{g,2}\) represent the respective mean expression values for a given gene \(g\) in two populations (for example, all existing patients suffering from T-ALL versus all patients suffering from pre-B ALL). Of course, we do not dispose of measurements for all the patients suffering from these two types of ALL in the world (the population). We only dispose of two sets of samples, covering 36 (T-ALL) and 44 (pre-B ALL) patients, respectively. On the basis of these samples, we will estimate how likely it is that genes \(g\) is generally expressed at similar levels in the populations from which the samples were drawn.
The essential difference between Student and Welch is that the proper Student test relies on the assumption that the two sampled populations have the same variance, whereas Welch’s test is designed to treat populations with unequal variances.
When detecting differentially expressed genes, we cannot assume equal variance. Indeed, a typical case would be that a gene of interest is expressed at very low level in the first group, and high level in the second group. The inter-individual fluctuations in expression values are expected to be larger when the gene is expressed at a high level than when it is poorly expressed. It is thus generally recommended to use Welch rather than Student test when analyzing microarray expression profiles.
BEWARE: Student and Welch tests assume data normality. Affymetrix microarray intensities are far from the normal distribution, even after log transformation. However, t-test is robust to non-normality if there is a sufficient number of samples per group. In the subsequent exercise, we will apply Welch test to detect genes differentially expressed between cancer types represented by ~40 samples each. We are thus in reasonably good conditions to run a Welch test. Nevertehless, in a next section we will also apply a non-parametric test (Wilcoxon), which does not rely on an assumption of normality.
Welch’s t-test defines the t statistic by the following formula:
\[ t=\frac{\bar{x_1} - \bar{x_2}}{\sqrt{\frac {s^2_1}{n_1} + \frac{s^2_2}{n_2}}}\]
Where:
The t.test() function can be used to calculate this score (and additional informations such as p.value). This function returns an S3 object whose slots can be listed using the names() function and accessed using the $ operator (such as with lists in R).
In order to get an intuition of the \(t\) statistics, let us create artificial datasets and compute the associated \(t\) value. In the following example \(x\) and \(y\) can be viewed as the expression values for gene \(g\) in two different classes of cancer.
Assuming that each group contains 4 patients, we will generate 4 random numbers following a normal distribution, to simulate the groups 1 and 2. We deliberately set the means to the same values (to fall under the null hypothesis), but we generate them with different standard deviations.
x <- rnorm(n=4, mean=6, s=1)
y <- rnorm(n=4, mean=6, s=2)
We would like to define genes that discriminate between “hyperdiploid” tumors and tumors of all the other subtypes represented by at least 10 samples in Den Boer dataset.
One possibility would be to iterate over all probesets, and to successively run the R method t.test() on each one. This would however be quite inefficient, and the results would not be very easy to handle, since it would be a list of objects of the class t.test.
Instead, we will use a custom function that runs Student or Welch test in parallel on all the elements of a data table.
First we need to check if the qvalue library is installed (we will give more information about q-values in the next sessions).
### Running t-tests on each row of a data matrix
## We must first check if the q-value library from Bioconductor has
## been installed (if not, will be installed here)
if (!require("qvalue")) {
source("http://bioconductor.org/biocLite.R")
biocLite("qvalue")
}
## Loading required package: qvalue
The we will load a custom script written by J. van Helden (Note: the utilities for this course will soon be converted to an R package, in order to facilitate their installation and use).
## Load a custom the library for multiple t tests
url.stats4bioinfo <- "http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics"
source(file.path(url.stats4bioinfo, 'R-files/config.R'))
## [1] "Data repository http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/data"
## [1] "R scripts source http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/R-files"
## [1] "Results will be saved to /Users/jvanheld/course_stats_bioinfo/results"
## [1] "Figures will be saved to /Users/jvanheld/course_stats_bioinfo/figures"
source(file.path(url.stats4bioinfo, 'R-files/util/util_student_test_multi.R'))
For the sake of curiosity, you can also have a look at the R code.
We will select genes differentially expressed between one subtype of interest (for example hyperdiploid) and all the other types of ALL represented by at least 10 samples. For the rest of the tutorial, we will refer to these subtypes as “Other”.
## Classes to keep
print("Selecting cancer subtypes with >= 10 samples")
## [1] "Selecting cancer subtypes with >= 10 samples"
class.freq <- table(pheno$Sample.title)
classes.to.keep <- names(class.freq[class.freq>10])
subtype.of.interest <- "hyperdiploid"
classes.other <- setdiff(classes.to.keep, subtype.of.interest)
print(classes.to.keep)
## [1] "hyperdiploid" "pre-B ALL" "T-ALL" "TEL-AML1"
## Define a Boolean vector indicating which samples belong
## to the two selected subtypes.
samples.to.keep <- pheno$Sample.title %in% classes.to.keep
sum(samples.to.keep)
## [1] 167
## Extact a subset of expression matrix with only the two selected sets
expr.matrix.kept <- expr.matrix[,samples.to.keep]
## Export the table with the selected samples, in order to open it with TMEV
setwd(dir.output)
file <- paste(sep="", "GSE13425_Norm_",subtype.of.interest,"_vs_Other_ge10samples.txt")
write.table(expr.matrix.kept,
file,
col.names=NA,quote=F,sep="\t")
## Define a vector with the sample types for the two selected cancer subtype
sample.group <- as.vector(pheno[samples.to.keep, "Sample.title"])
names(sample.group) <- names(expr.matrix[samples.to.keep])
sample.group[sample.group != subtype.of.interest] = "Other"
## Count the number of samples per group (ALL subtype)
table(sample.group)
## sample.group
## hyperdiploid Other
## 44 123
## Export sample groups, which will be used in other practicals
## (e.g. supervised classification)
setwd(dir.output)
file <- paste(sep="",
"GSE13425_Norm_",subtype.of.interest,"_vs_Other_sample_groups.txt")
write.table(as.data.frame(sample.group),
file,
col.names=FALSE,
row.names=TRUE,
quote=F,sep="\t")
We will now apply the Welch test on each gene of the Den Boer dataset, in order to select genes differentially expressed between the subtype of interest (“hyperdiploid”) and the other subtypes represented by at least 10 genes.
## Run the Welch test on each probesets of the DenBoer expression matrix.
## We will store the result in a table called "DEG" for "Differentially expressed genes", which will later be completed by other tests (e.g. Wilcoxon).
denboer.deg <- t.test.multi(expr.matrix.kept, sample.group, volcano.plot=FALSE)
## [1] "Mon Mar 9 14:39:43 2015 - Multiple t-test started"
## [1] "Mon Mar 9 14:39:45 2015 - Multiple t-test done"
## Inspect the result table
dim(denboer.deg)
## [1] 22283 17
names(denboer.deg)
## [1] "mean.Other" "mean.hyperdiploid" "means.diff"
## [4] "var.est.Other" "var.est.hyperdiploid" "sd.est.Other"
## [7] "sd.est.hyperdiploid" "st.err.diff" "t.obs"
## [10] "df.welch" "P.value" "E.value"
## [13] "sig" "fwer" "q.value"
## [16] "fdr" "rank"
## Select genes with a stringent threshold on E-value
eval.threshold <- 0.05
significant.probesets <- denboer.deg$E.value <= eval.threshold
table(significant.probesets) ## Count the number of significant probesets
## significant.probesets
## FALSE TRUE
## 20921 1362
We will compare the mean expression value between hyperdiploids and the other selected subtypes, and highlight the significant genes.
## Plot the gene-wise means
plot(denboer.deg[, c("mean.Other", "mean.hyperdiploid")], col="darkgray")
grid()
abline(a=0,b=1, col="black") # Draw the diagonal line
## Highlight significant genes
lines(denboer.deg[significant.probesets,
c("mean.Other", "mean.hyperdiploid")],
type='p', col="red")
legend("topleft",col=c("red", "darkgray"),
legend=c("Welch significant","non-significant"),
pch=1)
The apply function can be used to apply a given function to a matrix or data.frame. This function has tree required arguments:
args(apply)
## function (X, MARGIN, FUN, ...)
## NULL
In the line below, we define a function called return.t(), to run the Welch test on a single probeset of the microarray table.
## Define a function to return the p-value of a Welch test
return.t <- function(x,y){ t.test(x[y==subtype.of.interest], x[y=="Other"], alternative="two.sided", var.equal=FALSE, paired=FALSE)$p.value}
The volcano plot is a classical representation of differential expression analysis. In this diagram, the x axis represents the log ratio and the y axis the result of a statistic expressed as \(-log10(p-value)\).
Probably the most popular method for differential expression analysis of microarray data is “Significance Analysis of Microarrays” (SAM). SAM will compute for each gene a score \(d\) which is close to the \(t\) statistics of the welch’s test. However, it won’t require any assumption about the data distribution. In order to compute the expected distribution of \(d\) under the null hypothesis SAM will performe a set of permutations on the class labels, and compute each time a simulated sets of results for the \(d\) statistics. The observed and simulated results will be used to compute FDR values. Sam is implemented in several R libraries (e.g: siggenes). Here we will use a more interactive program called “MultiExperiment Viewer (MeV).
If you are working on a Linux system, the commands below can be used to download and compule MeV on the Linux console.
mkdir -p ~/bin cd ~/bin wget "http://freefr.dl.sourceforge.net/project/mev-tm4/mev-tm4/MeV%204.8.1/MeV_4_8_1_r2727_linux.tar.gz" tar xvfz MeV_4_8_1_r2727_linux.tar.gz cd - echo -e "\nalias tmev=\"cd ~/bin/MeV_4_8_1/; sh -c 'nohup ./tmev.sh &'; cd -\"" >> ~/.bashrc source ~/.bashrc tmev
The classical processing pipeline defined by Affymetrix associates a qualitative tag to each probeset, with three possible values:
An absent call means that no significant signal was detected with the associated probeset. However, this “absence” might either indicate that the gene is not expressed in this particular sample, or that the gene is undetectable (irrespective of its expression level) due to some technical problem with this specific probeset.
It became a classical practice to filter out the genes called “absent” on an important fraction of the samples in a given series, by implicitly assuming that their recurrent absence reveals a technical problem rather than a biologically relevant effect (repression of the gene).
In the following exercise, we will apply an A/M/P filter to discard the genes declared absent in at least 30% of the genes.
## Select a subset of the A/M/P matrix
amp.sub <- amp[,samples.to.keep]
dim(amp)
## [1] 22283 190
dim(amp.sub)
## [1] 22283 167
## Count the number of "Present" calls per probeset
isPresent <- amp.sub == "P"
present.per.probeset <- rowSums(isPresent)
hist(present.per.probeset, breaks=0:ncol(amp.sub))
## "Present filter": Select probeset declared present in at least 25% of the samples
retained <- rowSums(isPresent) >= 0.25*ncol(expr.matrix.kept)
table(retained) ## Count number of retained and rejected probesets
## retained
## FALSE TRUE
## 14013 8270
## Select a subset of the matrix with the retained probesets only
expr.matrix.kept <- expr.matrix.kept[retained, ]
## Check the number of probes (rows) and samples (columns) of the
## selected expression table
print(dim(expr.matrix.kept))
## [1] 8270 167
## Export the table with the selected samples, in order to open it with TMEV
setwd(dir.output)
file <- paste(sep="", "GSE13425_Norm_",subtype.of.interest,"_vs_Other_present30pc.txt")
write.table(expr.matrix.kept,
file,
col.names=NA,quote=F,sep="\t")