Abbreviation | Description |
---|---|
ALL | Acute Lymphoblastic Leukaemia |
A/M/P | Absent / Marginal / Present calls |
The goal of this practical is to understand the principles of Student t-test by exploring step-by-step the different components of the \(t_{obs}\) statistics. To this purpose, we will apply a test to detect differentially expressed genes (DEG) between two cancer types in microarray transcriptomic profiles.
As study case, we will analyse microarray transcriptomic profiles from (M. L. Den Boer et al. 2009).
In this experiment, the authors characterized the transcriptome of 190 patients suffering from Acute Lymphoblastic Leukaemia (ALL). Lymphoblastic leukaemia is characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation. The abnormal proliferation can result from various mutations, some of which are recurring in various patients. The DenBoer dataset includes information about the genetic problem associated with each leukaemia.
In their original article, Den Boer and co-workers defined a transcriptomic signature, i.e. a selection of genes whose transcription profile is specifically affected in one or more subtypes of leukaemia. The expression profiles of these genes can then be used to classify new samples from acute lymphoblastic leukaemia (ALL), and predict the type of associated genetic perturbation.
An advantage of this dataset is that we dispose of relatively large samples (>30 individuals per group), relative to many other publications of transcriptome analysis.
We will need to use some Bioconductor libraries in the subsequent steps (e.g the qvalue library). Bioconductor is a project offering open source libraries for R environnement dedicated to analysis of high-throughput genomic data. If Bioconductor is not already installed on your computer, copy and paste the following instructions in R:
## Install specifically the qvalue library
if (!require("qvalue")) {
## Install base libraries for Bioconductor
## If you are asked to create a personnal library answer "YES".
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite("qvalue")
}
The full dataset is available in the the Gene Expression Omnibus (GEO) database, as a series with ID GSE13425. However we do not need to download the raw dataset. Instead, we will provide on this web site the normalized matrix expression profiles in the form of tab-delimited files.
Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Information related to this platform are available on GEO website under identifier GPL96. More information related to Affymetrix microarrays normalization can be obtained in the section Normalization of Affymetrix DNA chip (slides) and Handling and normalizing affymetrix data with bioconductor (practical)).
Start R.
Have a look at the description of the read.table() function: in the R console, type help(read.table) and read the help page.
In the next sections, will load three data tables into R using the R function read.table(). This function allows us to directly read tab-delimited text files located either on our own computer or on a remote server (using an http URL). We will successively load 3 files providing complementary information.
We will define and create a directory to store the results on our computer.
## Define the output directory. You can adapt this to your local configuration.
dir.output <- "~/STAT2_practicals/GSE13425"
## Create the output directory (if it does not exist yet)
dir.create(dir.output, showWarnings=FALSE, recurs=TRUE)
## Change directory to dir.output
setwd(dir.output)
## Print a message with output directory
message("Output directory for this practical: ", dir.output)
## Check it this directory already contains some files.
list.files(dir.output)
[1] "data"
We will load the data from this course supporting Web site. For this we need to define two variables.
## Define the URL of the example data sets
url.course <- "http://pedagogix-tagc.univ-mrs.fr/courses/ASG1"
microarray.data.url <- file.path(url.course, "data", "marrays")
## Check the results
print(paste("Course URL: ", url.course))
[1] "Course URL: http://pedagogix-tagc.univ-mrs.fr/courses/ASG1"
print(paste("URL for the microarray data: ", microarray.data.url))
[1] "URL for the microarray data: http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays"
We will now load the expression matrix, which is stored in a file named GSE13425_Norm_Whole.txt in the microarray data folder of the server. This file contains a table with one row per probeset (~gene) and one column per sample. Each sample was taken from a patient suffering from a particular type of Acute Lymphoblastic Leukemia (ALL).
We will once use the following functions:
## Get some help about the read.table fonction
# help(read.table)
## URL of the expression table on the course server
expr.url <- file.path(microarray.data.url, "GSE13425_Norm_Whole.txt") ## Build the full URL of the matrix
## Load expression values
dir.data <- file.path(dir.output, "data")
dir.create(dir.data, recursive = TRUE, showWarnings = FALSE)
expr.file <- file.path(dir.data, "DenBoer_2009_GSE13425_Norm_Whole.txt")
if (file.exists(expr.file)) {
message("Expression file already downloaded: ", expr.file)
} else {
message("Downloading matrix table from ", expr.url)
download.file(url = expr.url, destfile = expr.file)
message("Local expression file: ", expr.file)
}
## Load the expression table in memory
message("Loading expression table from ", expr.file)
expr.matrix <- read.table(expr.file,sep="\t", head=T, row=1)
After having loaded the expression matrix with the R function read.table(), we should have a variable of type data.frame containing one row per gene and one column per sample. Before going further, we will perform some basic tests to check the content of this data frame.
## Check the type of the variable "expr.matrix" (should return data.frame)
class(expr.matrix)
[1] "data.frame"
## Get the dimensions of the expresion table
dim(expr.matrix)
[1] 22283 190
## Number of probesets (rows)
nrow(expr.matrix)
[1] 22283
## Number of samples (columns)
ncol(expr.matrix)
[1] 190
## Display the first 10 rows and 5 columns of the matrix
print(expr.matrix[1:10,1:5])
GSM338666 GSM338667 GSM338668 GSM338669 GSM338670
DDR1|1007_s_at 4.38 4.38 4.23 3.98 5.12
RFC2|1053_at 3.44 3.33 3.73 3.58 3.40
HSPA6|117_at 2.51 2.62 2.37 2.43 2.49
PAX8|121_at 6.24 6.21 6.10 6.09 6.53
GUCA1A|1255_g_at 2.29 2.23 2.17 2.26 2.27
UBA7|1294_at 5.57 5.83 5.46 5.20 6.83
THRA|1316_at 3.09 3.34 3.10 3.20 3.16
PTPN21|1320_at 2.35 2.46 2.30 2.34 2.21
CCL5|1405_i_at 2.45 2.52 2.48 2.44 2.22
CYP2E1|1431_at 2.33 2.42 2.18 2.15 2.27
Since we are generating this report with an R markdown document with RStudio, we can also use the kable() function of the R package knitr to generate a nicely formatted table.
## Generate a nicely formatted table.
library(knitr) ## Load the R library knitr
kable(expr.matrix[1:10,1:5]) ## Generate a nicely formatted table
GSM338666 | GSM338667 | GSM338668 | GSM338669 | GSM338670 | |
---|---|---|---|---|---|
DDR1|1007_s_at | 4.38 | 4.38 | 4.23 | 3.98 | 5.12 |
RFC2|1053_at | 3.44 | 3.33 | 3.73 | 3.58 | 3.40 |
HSPA6|117_at | 2.51 | 2.62 | 2.37 | 2.43 | 2.49 |
PAX8|121_at | 6.24 | 6.21 | 6.10 | 6.09 | 6.53 |
GUCA1A|1255_g_at | 2.29 | 2.23 | 2.17 | 2.26 | 2.27 |
UBA7|1294_at | 5.57 | 5.83 | 5.46 | 5.20 | 6.83 |
THRA|1316_at | 3.09 | 3.34 | 3.10 | 3.20 | 3.16 |
PTPN21|1320_at | 2.35 | 2.46 | 2.30 | 2.34 | 2.21 |
CCL5|1405_i_at | 2.45 | 2.52 | 2.48 | 2.44 | 2.22 |
CYP2E1|1431_at | 2.33 | 2.42 | 2.18 | 2.15 | 2.27 |
Before going any further, we will draw an histogram with the distribution of all the expression values for all genes and all samples.
The R function hist() enables to draw an histogram, give a vector of values. To get more information about this function, you can type help(hist) in the R console.
Since the expression matrix is stored in and R object of class data.frame, we will first need to convert it to a vector. For this, we can use the R function unlist().
## Histogram of all expression values
hist(unlist(expr.matrix), breaks=100,
main="Den Boer expression values",
xlab="RMA-normalized expression value (log2-transformed)",
ylab="Number of measures",
)
The global distribution of RMA-normalized expression values across for all samples and all genes ranges from 1.35 to 13.3. We should be aware that RMA normalization includes a log2 transformation.
This has important consequences
The actual intensities of the measured expression are obtained with \(I = 2^x\). For this expression matrix, intensities thus range from \(2^{min(x)} = 2.55\) to \(2^{max(x)} = 10086\).
Since the gene-wise group means were computed on log2-transformed values, they correspond to the geometric mean rather than the arithmetic mean.
\[\bar{x}_{i,g} = \frac{1}{n_g} \sum_{j \in g} (x_{i,j}) = \frac{1}{n_g} \sum_{j \in g} (log_2(I_{i,j})) = \frac{1}{n_g} log_2(\prod_{j \in g} (I_{i,j})) = log_2(\sqrt[n_g]{\prod_{j \in g} (I_{i,j})})\]
For a given gene \(i\), the difference between sample means of two groups (\(g_1\) and \(g_2\)) can be computed as follows.
\[d_i = \bar{x}_{i,g_2} - \bar{x}_{i,g_1}\]
The corresponding fold-change of expression can be computed as follows.
\[FC_i = 2^{d_i}\]
For a comparison between groups 1 and 2 up-regulated genes are those with a fold-change \(\gt 1\), which corresponds to a positive log2 fold change. Reciprocally, down-regulated genes are those with a fold-change \(lt 1\), and have a negative log2 fold change.
Let us take two illustrative examples of up- and down-regulation.
\[a / b = 2 \Rightarrow log2(a/b) = log2(2) \Rightarrow log2(a) - log2(b) = 1 \Rightarrow log2(a) = log(b) + 1 \]
We can consider our \(X\) values as log2-transformed intensities of the first group (\(x = log2(a)\)), and our \(Y\) values as log2-transformed intensities of the second group (\(y = log2(b)\)).
\[a / b = 1/2 \Rightarrow log2(a/b) = log2(1/2) \Rightarrow log2(a) - log2(b) = -1 \Rightarrow log2(a) = log(b) - 1 \]
In the rest of this tutorial, all the analyses will be led on the RMA-normalized values (this is the standard procedure), but we should always keep in mind that the difference between group means of RMA-normalized values is actually a log2-fold change.
We will now load the phenotypic data, which is crucial to any analysis since this data type contains the information on all the samples found in a microarray expression matrix.
## Load phenotypic data
pheno.url <- file.path(microarray.data.url, 'phenoData_GSE13425.tab')
pheno.file <- file.path(dir.data, 'phenoData_GSE13425.tab')
## Download pheno table if not yet present in data dir
if (file.exists(pheno.file)) {
message("Pheno file already downloaded: ", pheno.file)
} else {
message("Downloading pheno table from ", pheno.url)
download.file(url = pheno.url, destfile = pheno.file)
message("Local pheno file: ", expr.file)
}
## Load pheno table in memory
message("Loading pheno table from ", pheno.file)
pheno <- read.table(pheno.file, sep='\t', head=TRUE, row=1)
## Check the size of the pheno table
dim(pheno)
[1] 190 6
## Should give 190 4, i.e. one row per sample, and one column per information
## Get the information type (naes of the pheno table)
names(pheno)
[1] "Sample.title" "Sample.source.name.ch1" "Sample.characteristics.ch1" "Sample.description" "sample.labels" "sample.colors"
## Extract the cancer subtype from the pheno matrix
cancer.type <- as.vector(pheno$Sample.title)
## Check the length of the vector (should be the number of patients, i.e. the number of colums of the expression matrix)
length(cancer.type)
[1] 190
## Check the 10 first values of the cancer type vector
head(cancer.type, n=10)
[1] "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL"
## Check the 10 last values of the cancer type vector
tail(cancer.type, n=10)
[1] "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL"
## Check some other values of the cancer type vector
print(cancer.type[140:150])
[1] "BCR-ABL" "BCR-ABL" "BCR-ABL + hyperdiploidy" "MLL" "MLL" "MLL" "MLL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL"
Th R function table() permits to count the number of elements of different types within a given dataset. We can use it to count the number of samples of each type among the 190 samples. We will sort the result by decreasing size using the R function sort()
## Sort the resulting (one-column) table by decreasing number of samples.
cancer.type.counts <- sort(table(cancer.type), decreasing = TRUE)
## The funciton table() returns an object of a specific R class, named "table"
class(cancer.type.counts)
[1] "table"
## Sort the resulting (one-column) table by decreasing number of samples.
print(cancer.type.counts)
cancer.type
hyperdiploid pre-B ALL TEL-AML1 T-ALL E2A-rearranged (EP) BCR-ABL E2A-rearranged (E-sub) MLL BCR-ABL + hyperdiploidy E2A-rearranged (E) TEL-AML1 + hyperdiploidy
44 44 43 36 8 4 4 4 1 1 1
We can also nicely format the result table using the knitr kable() function. However, this require to convert the table object into a data frame. For this, we can use the R function as.data.frame().
## We can also the number of samples per subtype of leukemia in a
## well-formated table. For this we convert the table in a data.frame
## object, which can be treated by the kable() function.
kable(as.data.frame.table(cancer.type.counts),
caption = "Number of biological samples per subtype of acute lymphoblastic leukemia in Den Boer dataset. ")
cancer.type | Freq |
---|---|
hyperdiploid | 44 |
pre-B ALL | 44 |
TEL-AML1 | 43 |
T-ALL | 36 |
E2A-rearranged (EP) | 8 |
BCR-ABL | 4 |
E2A-rearranged (E-sub) | 4 |
MLL | 4 |
BCR-ABL + hyperdiploidy | 1 |
E2A-rearranged (E) | 1 |
TEL-AML1 + hyperdiploidy | 1 |
We can see that some cancer type are represented by a large number of biological samples (hyperdiploid, pre-B ALL, TEL-AML1 and T-AML) whereas other types are represented by a handful of samples, or even a single one.
Since differential analysis requires a sufficient number of observations per group, we will restrict the following steps to the four best represented subtypes in this data set.
In Affymetrix, each probeset is labelled with a tag which can take three values: Absent (A), Marginal (M), or Present (P), depending on the intensity detected in the different probes representing a given transcript.
## Load Absent/Marginal/Present (AMP) calls
amp.url <- file.path(microarray.data.url, "GSE13425_AMP_Whole.txt")
amp.file <- file.path(dir.data, "GSE13425_AMP_Whole.txt")
if (file.exists(amp.file)) {
message("Absent/Marginal/Presnt file already downloaded: ", amp.file)
} else {
message("Downloading AMP table from ", amp.url)
download.file(url = amp.url, destfile = amp.file)
message("Local AMP file: ", amp.file)
}
## Load AMP table in memory
amp <- read.table(amp.file, sep="\t", head=T, row=1)
## Count the total number of calls of each type (A/M/P)
table(unlist(amp))
A M P
2871373 140251 1222146
## Compute a matrix of Boolean values indicating, for each entry of the expression matrix, whether it is absent or not
absent.calls <- amp=="A"
## Count the number of "Absent" labels per gene as the marginal sum of absent calls matrix.
absent.per.gene <- apply(absent.calls, 1, sum)
## Count the number of genes detected in at least 30 samples
sum(absent.per.gene <= (ncol(expr.matrix) - 30))
[1] 10540
We can draw an histogram indicating the distribution of the number of absent calls per gene.
## Draw an histogram showing the number of absent calls per gene
hist(absent.per.gene, breaks=0:ncol(expr.matrix),
xlab="Number of absent calls", ylab="Number of probesets (genes)",
main="Number of absent calls per probeset")
At first sight, the “absent/marginal/present” calls may seem frightening: altogether, the expression matrix contains more probesets declared absent than present, and the histogram shows that several thousands of genes (4064) are declared absent in absolutely all the samples.
Note that the interpretation of “absent” labels is not obvious. Genes declared “absent” in almost all samples are often considered irrelevant because they are supposed to be undetectable with the microarray, possibly for some technical reason. However, another possibility would be that these genes are actually inactive in most samples (for example because they are generally not expressed in blood cells), and their detection in some samples might reveal a dysregulation of these genes. For this tutorial, we will thus deliberately avoid the classical “filtering” step, which consists in retaining only the genes declared present in a given number of samples.
Note that we will still be able to revise this decision at the very end of the analysis pipeline, by computing the number of genes declared differentially expressed yet absent from most samples.
We provide in the help page a table of symbols and formulas + a short reminder of the principle of the Student t test.
We showed above how to count the number of biological samples per subtype of ALL. In the questions below, we will refer to these subtypes (groups) as g1 and g2.
Each student should now choose two groups containing at least 30 samples, and develop a code to perform the following operations.
For example, for this tutorial I will compare the TEL-AML1 and T-ALL subtypes.
During this practical, you will be led to draw a succession of plots (scatter plots, histograms, …), which will progressively lead you to compute the formula of the \(t_{obs}\) statistics. The important is not so much the final result (you can find this formula in any textbook of statistics), but the way to it: at each step, you will have to summarize in 2 sentences what can be seen on the figure you just generated, and how it relates to the figure generated at the previous step (this is the actual exercise).
Count the number of samples per subtype of ALL, choose two groups (g1, g2) with at least 30 individuals, and build two vectors (g1.columns, g2.columns) indicating which columns of the matrix correspond to these respective groups.
print(data.frame(sort(table(cancer.type), decreasing=TRUE)))
cancer.type Freq
1 hyperdiploid 44
2 pre-B ALL 44
3 TEL-AML1 43
4 T-ALL 36
5 E2A-rearranged (EP) 8
6 BCR-ABL 4
7 E2A-rearranged (E-sub) 4
8 MLL 4
9 BCR-ABL + hyperdiploidy 1
10 E2A-rearranged (E) 1
11 TEL-AML1 + hyperdiploidy 1
## Choose the two groups of interest
## (each student should choose two groups of her/his own, with at least 30 elements)
g1 <- "T-ALL"
g2 <- "TEL-AML1"
## Select the columns of the expression table corresponding to these two groups
g1.columns <- which(cancer.type == g1)
g2.columns <- which(cancer.type == g2)
For this, we use the R function apply(), which applies any function to either the rows or the columns of a matrix.
mean.per.gene <- apply(expr.matrix, 1, mean, na.rm=TRUE)
We will now apply the command mean() to a subset of the columns restricted to the selected groups.
## Compute the mean for the whole dataset
all.means <- apply(expr.matrix, 1, mean, na.rm=TRUE)
## Compute the mean per group in two separate vectors
## (one vector per group, with one entry per gene)
g1.means <- apply(expr.matrix[,g1.columns], 1, mean, na.rm=TRUE)
g2.means <- apply(expr.matrix[,g2.columns], 1, mean, na.rm=TRUE)
Rather than working with a lot of separate vectors for each computed gene-wise statistics (group means, variances, standard deviations, …), we will create a data frame with one row per gene, and one column per statistics.
## Create a result table, which will contain the results of the successive computations
result.table <- data.frame(m = all.means,
m1 = g1.means,
m2 = g2.means)
## Check the first entries of the result table
kable(head(result.table),
caption="First rows of the result table, used to collect different statistics per gene.",
digits=2)
m | m1 | m2 | |
---|---|---|---|
DDR1|1007_s_at | 4.57 | 4.20 | 4.59 |
RFC2|1053_at | 3.17 | 3.37 | 3.06 |
HSPA6|117_at | 2.73 | 2.78 | 2.86 |
PAX8|121_at | 6.16 | 6.10 | 6.15 |
GUCA1A|1255_g_at | 2.18 | 2.18 | 2.18 |
UBA7|1294_at | 5.60 | 5.46 | 5.73 |
In the next steps we will progressively add columns to this table. At the end of the analysis, this will enable us to export the totality of the result, or selected columns if we prefer.
We first present a minimal solution.
## Generate a rough plot comparing the mean expression values between the two groups of interest.
plot(result.table$m1, result.table$m2)
We can improve the plot as follows
Add to the plot some lines to denote the milestones:
For this we use the R function abline(), which can be used in three modalities:
## Compute the axis range: we want to go from 0 (the minimal possible
## expression value) to the first integer that exceeds (ceiling) all
## expression values from group 1 and group 2.
axis.range <- c(0, ceiling(max(result.table[,c("m1","m2")])))
plot(result.table$m1,
result.table$m2,
main="Comparison between group means",
xlab=paste(g1, "mean (RMA-norm log2 values)"),
ylab=paste(g2, "mean (RMA-norm log2 values)"),
xlim=axis.range, ylim=axis.range,
pch=20,
col = densCols(x=result.table$m1,
y=result.table$m2)
)
grid(lty="solid", col="darkgray")
## Add some milestone lines
abline(h=0, col="black")
abline(v=0, col="black")
abline(a=0, b=1)
abline(a=1, b=1, col="red")
abline(a=-1, b=1, col="red")
Note: as discussed above, all values were log2-transformed during the preprocessing (RMA normalization). Consequently, the mean comparison plot,
In the plots above we saw that many genes seem to have a quite different mean expression values between the two selected groups of ALL. However, the difference between two means must be interpreted with caution, because it can reflect either an effect related to the populations themselves (the two cancer types), or stochastic variations of the measurements, which depend on the particular samples that were drawn (sampling fluctuations).
The classical way to solve this issue is to interpret the difference between means relative to the variance.
Beware, the R function var() does not compute the sample variance (\(s^2\)), but rather a sample-based estimate of the population variance (\(\hat{\sigma}^2\)). We thus need to apply the inverse correction as the one used to estimate population variance from sample variance.
\[s^2 = \hat{\sigma}^2 \cdot \frac{n-1}{n}\]
## Compute sample variance for all genes together
n <- ncol(expr.matrix)
result.table$var<- (n-1)/n * apply(expr.matrix, 1, var, na.rm=TRUE)
## Compute the sample variance for each group, knowing that the var()
## function returns the corrected estimate of population variance.
n1 <- length(g1.columns) ## Count the number of samples in group 1
result.table$var.g1 <- (n1-1)/n1*apply(expr.matrix[,g1.columns], 1, var, na.rm=TRUE)
n2 <- length(g2.columns) ## Count the number of samples in group 2
result.table$var.g2 <- (n2-1)/n2*apply(expr.matrix[,g2.columns], 1, var, na.rm=TRUE)
## Compute standard deviations
result.table$sd <- sqrt(result.table$var) ## Sample sd for the whole expression matrix
result.table$sd.g1 <- sqrt(result.table$var.g1) ## Sample sd for group 1
result.table$sd.g2 <- sqrt(result.table$var.g2) ## Sample sd for group 2
## Check the result
kable(head(result.table), digits=2,
caption="First rows of the result table, used to collect different statistics per gene.")
m | m1 | m2 | var | var.g1 | var.g2 | sd | sd.g1 | sd.g2 | |
---|---|---|---|---|---|---|---|---|---|
DDR1|1007_s_at | 4.57 | 4.20 | 4.59 | 0.19 | 0.08 | 0.13 | 0.43 | 0.28 | 0.37 |
RFC2|1053_at | 3.17 | 3.37 | 3.06 | 0.06 | 0.09 | 0.03 | 0.25 | 0.30 | 0.18 |
HSPA6|117_at | 2.73 | 2.78 | 2.86 | 0.45 | 0.61 | 0.77 | 0.67 | 0.78 | 0.88 |
PAX8|121_at | 6.16 | 6.10 | 6.15 | 0.03 | 0.03 | 0.03 | 0.17 | 0.18 | 0.16 |
GUCA1A|1255_g_at | 2.18 | 2.18 | 2.18 | 0.01 | 0.01 | 0.01 | 0.09 | 0.08 | 0.09 |
UBA7|1294_at | 5.60 | 5.46 | 5.73 | 0.19 | 0.19 | 0.15 | 0.44 | 0.43 | 0.38 |
par(mfrow=c(1,2))
## Compare gene-wise variances between the two groups.
max.var <- max(result.table[, c("var.g1","var.g2")])
var.range <- c(0, ceiling(max.var))
plot(result.table$var.g1,
result.table$var.g2,
main="Group-wise variances",
xlab=paste("Sample variance for", g1),
ylab=paste("Sample variance for", g2),
xlim=var.range, ylim=var.range,
col=densCols(result.table$var.g1, result.table$var.g2)
)
grid(lty="solid", col="grey")
abline(col="black", a=0, b=1) ## Plot the diagonal (no difference between variances)
## Compare gene-wise standard deviations between the two groups.
max.sd <- max(result.table[, c("sd.g1","sd.g2")])
sd.range <- c(0, ceiling(max.sd))
plot(result.table$sd.g1,
result.table$sd.g2,
main="Group-wise standard dev.",
xlab=paste(g1, "sample standard deviation"),
ylab=paste(g2, "sample standard deviation"),
xlim=sd.range, ylim=sd.range,
col=densCols(result.table$sd.g1, result.table$sd.g2)
)
grid(lty="solid", col="grey")
abline(col="black", a=0, b=1) ## Plot the diagonal (no difference between standard deviations)
par(mfrow=c(1,1))
## Draw a plot comparing the variance (Y) to the mean (X) for the whole expression table
plot(result.table$m, result.table$sd,
main=paste("All samples: mean vs var"),
xlab=paste("Sample mean"),
ylab=paste("Sample standard deviation"),
col=densCols(result.table$m, result.table$s2)
)
grid(lty="solid", col="gray")
We will set the R graphical parameter mfrow to draw the two graphs side-by-side for the sake of comparison.
## Draw a plot comparing the variance (Y) to the mean (X) for each group
par(mfrow=c(1,2))
plot(result.table$m1, result.table$var.g1,
main=paste(g1, "mean vs var"),
xlab=paste(g1, "sample mean"),
ylab=paste(g1, "sample variance"),
col=densCols(result.table$m1, result.table$var.g1)
)
grid(lty="solid", col="gray")
plot(result.table$m2, result.table$var.g2,
main=paste(g2, "mean vs var"),
xlab=paste(g2, "sample mean"),
ylab=paste(g2, "sample variance"),
col=densCols(result.table$m2, result.table$var.g2)
)
grid(lty="solid", col="gray")
par(mfrow=c(1,1))
On the mean comparison plot, most genes are aligned along the diagonal, with some fluctuations. Some genes are however clearly distant, suggesting that they are over-expressed in TEL-AML1 relative to T-ALL. Other points are clearly much lower than the diagonal, suggesting that they are over-expressed in T-ALL relative to TEL-AML1.
The variance and standard deviation plots show very large differences between group 1 and group 2. This is a general property, which does not seem to be restricted to genes with perturbed expression. Indeed, in contrast with the mean comparison plot, the points do not seem to be aligned along the diagonal. Instead, most points seem to spread all over the lower square. In such conditions, we could hardly support the assumption of homoscedasticity (equal variance between groups), and we should thus apply Welch’s test rather than Student. We will come back to this point later.
The mean versus variance plots show us that different genes can have very different variances, even though they have similar means. This means that the differences between the means will have to be evaluated relative to the variances. This effect is visible when the variance is computed for all the samples together, but also for each one of the two groups of interest, when the plots are drawn separately.
## The effect size is the difference between group means.
## Since the RMA-normalized data is log2-transformed, this effect size
## is equivalent to the "M" variable of the MA plots.
result.table$d <- result.table$m1 - result.table$m2
## Mean of the two group means.
## Since the RMA-normalized data is log2-transformed, this effect size
## is equivalent to the "A" variable of the MA plots.
result.table$A <- (result.table$m1 + result.table$m2)/2
plot(result.table$A,
result.table$d,
main="MA plot",
xlab= "Average expression: A = (m1 + m2)/2",
ylab = "Effect size: d = m2 - m1",
col=densCols(result.table$d, result.table$A)
)
grid(lty="solid", col="grey")
abline(col="black", h=0)
abline(col="red", h=c(1, -1))
Let us keep in mind that the RMA-normalized values are log2-transformed. Consequently:
The graph we just drawn thus corresponds to the classical MA plot, representing the log2-fold-change (\(M\)) as a function of the geometric mean of expression (\(A\)).
Note the topological correspondence between this plot and the mean comparison plot above.
We can simply derive the standard deviations from the above computed variances. The pooled standard deviation is computed according to the following formula (see the help page for more details]).
\[\hat{\sigma}_p = \sqrt{\frac{n_1 s_1^2 + n_2 s_2^2}{n_1+n_2-2}}\]
\[s = \sqrt{s^2}\]
## Compute pooled standard deviation
result.table$var.pooled <- (n1* result.table$var.g1 + n2* result.table$var.g2) / (n1+n2-2)
result.table$sd.pooled <- sqrt(result.table$var.pooled)
## Check the result
kable(head(result.table), digits=2,
caption="First rows of the result table, used to collect different statistics per gene.")
m | m1 | m2 | var | var.g1 | var.g2 | sd | sd.g1 | sd.g2 | d | A | var.pooled | sd.pooled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DDR1|1007_s_at | 4.57 | 4.20 | 4.59 | 0.19 | 0.08 | 0.13 | 0.43 | 0.28 | 0.37 | -0.40 | 4.40 | 0.11 | 0.33 |
RFC2|1053_at | 3.17 | 3.37 | 3.06 | 0.06 | 0.09 | 0.03 | 0.25 | 0.30 | 0.18 | 0.31 | 3.22 | 0.06 | 0.25 |
HSPA6|117_at | 2.73 | 2.78 | 2.86 | 0.45 | 0.61 | 0.77 | 0.67 | 0.78 | 0.88 | -0.07 | 2.82 | 0.72 | 0.85 |
PAX8|121_at | 6.16 | 6.10 | 6.15 | 0.03 | 0.03 | 0.03 | 0.17 | 0.18 | 0.16 | -0.05 | 6.13 | 0.03 | 0.17 |
GUCA1A|1255_g_at | 2.18 | 2.18 | 2.18 | 0.01 | 0.01 | 0.01 | 0.09 | 0.08 | 0.09 | -0.01 | 2.18 | 0.01 | 0.09 |
UBA7|1294_at | 5.60 | 5.46 | 5.73 | 0.19 | 0.19 | 0.15 | 0.44 | 0.43 | 0.38 | -0.27 | 5.59 | 0.17 | 0.41 |
## Plot the pooled standard deviations as a function of the effect size.
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
result.table$sd.pooled, ylab="pooled standard deviation",
main = "Standard deviation vs. effect size",
col=densCols(result.table$d, result.table$sd.pooled)
)
grid(col="grey", lty="solid")
## Line marking the absence of effect
abline(v=0, col="black")
## Line marking the inferior boundary of standard deviations (cannot be negative)
abline(h=0, col="black")
## Lines marking arbitrary effect sizes of 1 and -1 respectively,
## which correspond to fold-changes of 2 and 1/2 in raw microarray
## intensities.
abline(v=c(-1,1), col="red")
The plot above shows that the same effect size (e.g. a difference of \(d=2\) between the two groups) can be associated to very different standard deviations. We thus get the feeling that a given effect size should not be considered as significant when they are associated with a high rather than a low value for the pooled standard deviation. Indeed, when a gene is associated to a high sample variability, our estimation of the difference between groups can fluctuate, and we might have obtained another estimation of the difference with a different sampling.
This is precisely what the \(t\) statistics serves for: it provides an interpretation the difference between the means relative to the estimated sampling variations. In the next section we will compute it.
\(t_{obs} = \frac{\hat{\delta}}{\hat{\sigma}_\delta} = \frac{\bar{x}_2 - \bar{x}_1}{\sqrt{\frac{n_1 s_1^2 + n_2 s_2^2}{n_1+n_2-2} \left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}}\)
## Computation of Student t statistics
result.table$t.Student <- result.table$d / (result.table$sd.pooled * sqrt(1/n1+1/n2))
## Plot the pooled standard deviations as a function of the effect size.
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
result.table$t.Student, ylab="Student t statistics",
main = "Student t statistics vs. effect size",
col=densCols(result.table$d, result.table$t.Student)
)
grid(col="grey", lty="solid")
## Lines marking the absence of effect on both effect size and Student t statistics
abline(v=0, col="black")
abline(h=0, col="black")
## Lines marking arbitrary effect sizes of 1 and -1 respectively,
## which correspond to fold-changes of 2 and 1/2 in the raw microarray
## intensities.
abline(v=c(-1,1), col="red")
We can see here that the Student statistics (\(t_S\)) is correlated to the effect size (\(d\)), yet the two statistics show important differences: a given value of effect size (e.g. \(d=1\), denoted by a red vertical line on the plot) corresponds to a wide range of values for the \(t_S\) statistics.
As shown in the table of symbols and formulas, Student statistics has been obtained by dividing the effect size (the raw difference between the two sample means) by an estimation of the standard deviation of this difference (\(\hat{\sigma}_{delta}\)).
In summary, Student’s \(t\) statistics (\(t_S\)) is a way to measure the difference between group means relative to their dispersion.
For this, we need to define the degrees of freedom. In a two-sample t-test (as in our case), the degrees of freedom are computed as the number of elements in the two groups (\(n_1\), \(n_2\)) minus the number of means estimated from these elements (we estimated the means of group 1 and group 2, respectively). Thus: \(df = n_1 + n_2 - 2\).
We will perform a two-sided test, i.e. we want to estimate the significance for negative as well as positive differences between the two means. The simplest way to do this is to compute the p-value of the absolute difference, and to multiply it by two. This gives us the probability to observe by chance a \(t_S\) value at least as distant from the null value, on either tails of the distribution.
## Define the sign of the difference: up (m2 > m1) or down (m1 < m2)
result.table$sign <- "null"
result.table$sign[result.table$d > 0] <- "up"
result.table$sign[result.table$d < 0] <- "down"
## Count the number of up, down or null signs
print(table(result.table$sign))
down null up
11157 2 11124
## Computation of Student p-value.
## The attribute "lower.tail" will be TRUE if the difference is negative
result.table$p.value <- 2*pt(abs(result.table$t.Student),
df=n1 + n2 - 2, lower.tail = FALSE)
par(mfrow=c(1, 2))
################################################################
## Plot the pooled standard deviations as a function of the effect size.
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
result.table$p.value, ylab="Student p-value",
main = "Student p-value vs. effect size",
col=densCols(result.table$d, result.table$p.value)
)
grid(col="grey", lty="solid")
## Line marking the null effect
abline(v=0, col="black")
## Lines marking arbitrary effect sizes of 1 and -1 respectively,
## which correspond to fold-changes of 2 and 1/2 in the raw microarray
## intensities.
abline(v=c(-1,1), col="red")
## Line marking an arbitrary threshold of 5% on the p-value
abline(h=0.01, col="red")
################################################################
## Draw a volcano plot
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
-log10(result.table$p.value), ylab="-log10(p-value)",
main = "Volcano plot",
col=densCols(result.table$d, -log10(result.table$p.value))
)
grid(col="grey", lty="solid")
## Line marking the null effect
abline(v=0, col="black")
## Lines marking arbitrary effect sizes of 1 and -1 respectively,
## which correspond to fold-changes of 2 and 1/2 in the raw microarray
## intensities.
abline(v=c(-1,1), col="red")
## Line marking an arbitrary threshold of 5% on the p-value
abline(h=-log10(0.05), col="red")
par(mfrow=c(1,1))
## Histogram of the p-value distribution
hist(result.table$p.value, breaks=20, main="P-value histogram", xlab="Pvalue", ylab="Number of probesets", col="grey")
################################################################
## Check that the result for a particular gene is consistent with R t.test() function
test.gene <- 1
test.gene.student <- t.test(x=expr.matrix[test.gene, cancer.type==g1], y = expr.matrix[test.gene, cancer.type==g2], var.equal = TRUE)
print(test.gene.student)
Two Sample t-test
data: expr.matrix[test.gene, cancer.type == g1] and expr.matrix[test.gene, cancer.type == g2]
t = -5.2684, df = 77, p-value = 1.215e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5489154 -0.2477901
sample estimates:
mean of x mean of y
4.195833 4.594186
result.table$m1[test.gene]
[1] 4.195833
result.table$m2[test.gene]
[1] 4.594186
result.table$t.Student[test.gene]
[1] -5.268382
result.table$p.value[test.gene]
[1] 1.215063e-06
################################################################
## Expected number of positives below a given alpha threshold
## selected significance threshold (somewhat arbitrary)
alpha <- 0.05
## Count the number of genes declared significant with this alpha on the nominal p-value
n.positive <- sum(result.table$p.value < alpha)
n.tests <- nrow(result.table) ## Number of probesets (genes) tested
f.positive <- n.positive / n.tests ## Proportion of genes declared positive
exp.n.positive <- alpha * n.tests
The e-value of a gene is the number of false positives that would be expected if we set the threshold to the p-value that was obtained for this particular gene. In other terms, if we consider this gene (and all the genes having a smaller or equal p-value) as significant, how many false positives would we expect ?
The e-value is simply obtained by multiplying the nominal p-value (i.e. the p-value computed for each individual test) by the number of tests (in our case, the number of genes in the expression matrix).
################################################################
## Compute the e-value
n.tests <- nrow(expr.matrix)
result.table$e.value <- result.table$p.value * n.tests
result.table$sig <- -log10(result.table$e.value) ## Significance
################################################################
## Draw a volcano plot with the e-value rather than p-value
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
result.table$sig, ylab="sig = -log10(e-value)",
main = "E-value volcano plot",
col=densCols(result.table$d,result.table$sig),
ylim=c(min(result.table$sig), max(result.table$sig)*1.2)
)
grid(col="grey", lty="solid")
## Line marking the null effect
abline(v=0, col="black")
## Lines marking arbitrary effect sizes of 1 and -1 respectively,
## which correspond to fold-changes of 2 and 1/2 in the raw microarray
## intensities.
abline(v=c(-1,1), col="red")
## Line marking an arbitrary threshold of 5% (stringent) or 1% (lenient) on the e-value
abline(h=-log10(alpha), col="red") ## Stringent threshold: e-value <= 0.05
abline(h=-log10(1), col="red", lty="dashed") ## Lenient threshold : e-value <= 1
## Draw a legend with the number of genes declared positive
legend("topleft",
legend=c(paste("e-value <=", alpha, ":", sum(result.table$e.value <= alpha)),
paste("e-value <=", 1, ":", sum(result.table$e.value <= 1))),
lty=c("solid","dashed"),
col="red",
lwd=2)
################################################################
## Compute the FDR (False Discovery Rate)
result.table$fdr <- stats::p.adjust(result.table$p.value, method="fdr")
## Compare FDR to p-value
plot(result.table$p.value, result.table$fdr, log="xy"); abline(a=0, b=1); grid()
## This FDR seems strange, it does not correspond to my computations, and it seems to be a monotonous function of the p-value, which should not be the case
## Compute the q-value
# source("https://bioconductor.org/biocLite.R")
# biocLite("qvalue")
library(qvalue)
################################################################
## Draw a volcano plot with the e-value rather than p-value
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
-log10(result.table$fdr), ylab="sig = -log10(FDR)",
main = "FDR volcano plot",
col=densCols(result.table$d,-log10(result.table$fdr)),
ylim=c(min(-log10(result.table$fdr)), max(-log10(result.table$fdr))*1.2)
)
grid(col="grey", lty="solid")
## Line marking the null effect
abline(v=0, col="black")
## Lines marking arbitrary effect sizes of 1 and -1 respectively,
## which correspond to fold-changes of 2 and 1/2 in the raw microarray
## intensities.
abline(v=c(-1,1), col="red")
## Line marking an arbitrary threshold of 5% on the FDR
abline(h=-log10(alpha), col="red")
## Draw a legend with the number of genes declared positive
legend("topleft",
legend=c(paste("e-value <=", alpha, ":", sum(result.table$e.value <= alpha)),
paste("e-value <=", 1, ":", sum(result.table$e.value <= 1))),
lty=c("solid","dashed"),
col="red",
lwd=2)
## Negative control set
In order to get a feeling of the expected behaviour of the test under null hypothesis (in absence of differences between the two groups), we will generate a negative control set, i.e. an artificial dataset compliant with the assumptions of the Welch test (normal populatins with different variances), and measure the same statistics as with the expression matrix.
In addition, we will calibrate our random dataset in order to simulate the actual expression matrix in the following way.
The artificial data will be drawn in random normal populations of equal means (\(H_0= \mu_1 = \mu2\)) but where the variances (\(\sigma1\), \(\sigma2\)) may differ between groups (heteroscedasticity).
For each row of the random dataset, the population means will be set equal to the mean expression value of the corresponding row (gene) of the expression table.
We will then generate the same plots as above, but systematically compare the results obtained with the expression matrix and with the random data, respectively.
g <- nrow(expr.matrix)
## First group
n1 <- length(g1.columns)
mu1 <- apply(expr.matrix[, c(g1.columns, g2.columns)], 1, mean, na.rm=TRUE)
sigma1 <- apply(expr.matrix[, g1.columns], 1, sd, na.rm=TRUE)
## Second group
n2 <- length(g2.columns)
mu2 <- apply(expr.matrix[, c(g1.columns, g2.columns)], 1, mean, na.rm=TRUE)
sigma2 <- apply(expr.matrix[, g2.columns], 1, sd, na.rm=TRUE)
group1 <- matrix(nrow=g, rnorm(n = g * n1, mean = mu1, sd = sigma1))
# View(group1)
group2 <- matrix(nrow=g, rnorm(n = g * n2, mean = mu2, sd = sigma2))
dim(group2)
[1] 22283 43
## Merge the two datasets in a single matrix
rnorm.data <- cbind(group1, group2)
class(rnorm.data)
[1] "matrix"
dim(rnorm.data)
[1] 22283 79
rand.result.table <- data.frame(
m1 = apply(group1, 1, mean),
m2 = apply(group2, 1, mean))
par(mfrow=c(1,2))
## Compute the axis range: we want to go from 0 (the minimal possible
## expression value) to the first integer that exceeds (ceiling) all
## expression values from group 1 and group 2.
axis.range <- c(0, ceiling(max(result.table[,c("m1","m2")])))
plot(result.table$m1,
result.table$m2,
main="Comparison between group means\nDen Boer, 2009",
xlab=paste(g1, "mean (RMA-norm log2 values)"),
ylab=paste(g2, "mean (RMA-norm log2 values)"),
xlim=axis.range, ylim=axis.range,
pch=20,
col = densCols(x=result.table$m1,
y=result.table$m2)
)
grid(lty="solid", col="darkgray")
abline(a=0, b=1)
abline(a=1, b=1, col="red")
abline(a=-1, b=1, col="red")
plot(rand.result.table$m1,
rand.result.table$m2,
main="Random dataset under H0",
xlab=paste("Group 1 mean"),
ylab=paste(g2, "Group 2 mean"),
xlim=axis.range, ylim=axis.range,
pch=20,
col = densCols(x=rand.result.table$m1,
y=rand.result.table$m2)
)
grid(lty="solid", col="darkgray")
# plot(rand.mean1, rand.mean2,
# col="grey", main="Negative control\n(random normal under H0)", xlab="m1", ylab="m2"
# )
# grid()
abline(a=0, b=1)
abline(a=1, b=1, col="red")
abline(a=-1, b=1, col="red")
par(mfrow=c(1,1))
## The effect size is the difference between group means.
## Since the RMA-normalized data is log2-transformed, this effect size
## is equivalent to the "M" variable of the MA plots.
rand.result.table$d <- rand.result.table$m1 - rand.result.table$m2
## Mean of the two group means.
## Since the RMA-normalized data is log2-transformed, this effect size
## is equivalent to the "A" variable of the MA plots.
rand.result.table$A <- (rand.result.table$m1 + rand.result.table$m2)/2
par(mfrow=c(1,2))
xrange <- range(result.table$A)
yrange <- range(result.table$d)
plot(result.table$A,
result.table$d,
main="MA plot\nDen Boer, 2009",
xlab= "Average expression: A = (m1 + m2)/2",
ylab = "Effect size: d = m2 - m1",
xlim=xrange,
ylim=yrange,
col=densCols(result.table$d, result.table$A)
)
grid(lty="solid", col="grey")
abline(col="black", h=0)
abline(col="red", h=c(1, -1))
plot(rand.result.table$A,
rand.result.table$d,
main="MA plot\nRandom dataset under H0",
xlab= "Average expression: A = (m1 + m2)/2",
ylab = "Effect size: d = m2 - m1",
xlim=xrange,
ylim=yrange,
col=densCols(rand.result.table$d, rand.result.table$A)
)
grid(lty="solid", col="grey")
abline(col="black", h=0)
abline(col="red", h=c(1, -1))
par(mfrow=c(1,1))
Den Boer, Monique L, Marjon van Slegtenhorst, Renée X De Menezes, Meyling H Cheok, Jessica G C A M Buijs-Gladdines, Susan T C J M Peters, Laura J C M Van Zutven, et al. 2009. “A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study.” The Lancet Oncology 10 (2): 125–34.