Abbreviations

Abbreviation Description
ALL Acute Lymphoblastic Leukaemia
A/M/P Absent / Marginal / Present calls

Goal of this practical

The goal of this practical is to understand the principles of Student t-test by exploring step-by-step the different components of the \(t_{obs}\) statistics. To this purpose, we will apply a test to detect differentially expressed genes (DEG) between two cancer types in microarray transcriptomic profiles.


Study case

As study case, we will analyse microarray transcriptomic profiles from (M. L. Den Boer et al. 2009).

In this experiment, the authors characterized the transcriptome of 190 patients suffering from Acute Lymphoblastic Leukaemia (ALL). Lymphoblastic leukaemia is characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation. The abnormal proliferation can result from various mutations, some of which are recurring in various patients. The DenBoer dataset includes information about the genetic problem associated with each leukaemia.

In their original article, Den Boer and co-workers defined a transcriptomic signature, i.e. a selection of genes whose transcription profile is specifically affected in one or more subtypes of leukaemia. The expression profiles of these genes can then be used to classify new samples from acute lymphoblastic leukaemia (ALL), and predict the type of associated genetic perturbation.

An advantage of this dataset is that we dispose of relatively large samples (>30 individuals per group), relative to many other publications of transcriptome analysis.

Pre-requesite: installing Bioconductor

We will need to use some Bioconductor libraries in the subsequent steps (e.g the qvalue library). Bioconductor is a project offering open source libraries for R environnement dedicated to analysis of high-throughput genomic data. If Bioconductor is not already installed on your computer, copy and paste the following instructions in R:

## Install specifically the qvalue library
if (!require("qvalue")) {
  ## Install base libraries for Bioconductor
  ## If you are asked to create a personnal library answer "YES".
  source("https://bioconductor.org/biocLite.R")
  biocLite()
  biocLite("qvalue")
}

Pre-processing of the data

The full dataset is available in the the Gene Expression Omnibus (GEO) database, as a series with ID GSE13425. However we do not need to download the raw dataset. Instead, we will provide on this web site the normalized matrix expression profiles in the form of tab-delimited files.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Information related to this platform are available on GEO website under identifier GPL96. More information related to Affymetrix microarrays normalization can be obtained in the section Normalization of Affymetrix DNA chip (slides) and Handling and normalizing affymetrix data with bioconductor (practical)).


Tutorial: loading data tables into R

Summary

  • Start R.

  • Have a look at the description of the read.table() function: in the R console, type help(read.table) and read the help page.

  • In the next sections, will load three data tables into R using the R function read.table(). This function allows us to directly read tab-delimited text files located either on our own computer or on a remote server (using an http URL). We will successively load 3 files providing complementary information.

    • the expression matrix (GSE13425_Norm_Whole.txt)
      • Contains genes as rows and samples as columns.
      • Data were previously normalized using RMA algorithm (they are thus transformed in logarithm base 2).
    • the A/P/M matrix (GSE13425_AMP_Whole.txt)
      • Indicates whether a gene was called Absent, Present or Marginal.
    • Phenotypic data (GSE13425_phenoData.txt)
      • The GSE13425_phenoData.txt file contains phenotypic information about samples.

Creating an output directory on your own computer

We will define and create a directory to store the results on our computer.

## Define the output directory. You can adapt this to your local configuration.
dir.output <- "~/STAT2_practicals/GSE13425"

## Create the output directory (if it does not exist yet)
dir.create(dir.output, showWarnings=FALSE, recurs=TRUE)

## Change directory to dir.output
setwd(dir.output)

## Print a message with output directory
message("Output directory for this practical: ", dir.output)

## Check it this directory already contains some files.
list.files(dir.output)
[1] "data"

Definition of the course URL and data path

We will load the data from this course supporting Web site. For this we need to define two variables.

  1. The general URL of this course, which we store in a variable url.course.
  2. The path to the microarray datasets, which is defined by appending the folder (data/marrays) to the URL of the course. For this, we use the R function file.path(), which automatically concatenates the pieces of a file or folder path.
## Define the URL of the example data sets
url.course <- "http://pedagogix-tagc.univ-mrs.fr/courses/ASG1"
microarray.data.url <- file.path(url.course, "data", "marrays")

## Check the results
print(paste("Course URL: ", url.course))
[1] "Course URL:  http://pedagogix-tagc.univ-mrs.fr/courses/ASG1"
print(paste("URL for the microarray data: ", microarray.data.url))
[1] "URL for the microarray data:  http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays"

Loading the expression matrix

We will now load the expression matrix, which is stored in a file named GSE13425_Norm_Whole.txt in the microarray data folder of the server. This file contains a table with one row per probeset (~gene) and one column per sample. Each sample was taken from a patient suffering from a particular type of Acute Lymphoblastic Leukemia (ALL).

We will once use the following functions:

  • file.path() to build the full URL to the expression table
  • download.file() to download the file from the URL and obtain a local copy. This will avoid us to download the whole file each time we restart the analysis.
  • read.table() to read the table from this URL. Note that this functions allows you to load a table either from a local folder (on your computer), or from any Web site.
## Get some help about the read.table fonction
# help(read.table)

## URL of the expression table on the course server
expr.url <- file.path(microarray.data.url, "GSE13425_Norm_Whole.txt") ## Build the full URL of the matrix

## Load expression values
dir.data <- file.path(dir.output, "data")
dir.create(dir.data, recursive = TRUE, showWarnings = FALSE)
expr.file <- file.path(dir.data, "DenBoer_2009_GSE13425_Norm_Whole.txt")

if (file.exists(expr.file)) {
  message("Expression file already downloaded: ", expr.file)
} else {
  message("Downloading matrix table from ", expr.url)
  download.file(url = expr.url, destfile = expr.file)
  message("Local expression file: ", expr.file)
}

## Load the expression table in memory
message("Loading expression table from ", expr.file)
expr.matrix <-  read.table(expr.file,sep="\t", head=T, row=1)

Questions

After having loaded the expression matrix with the R function read.table(), we should have a variable of type data.frame containing one row per gene and one column per sample. Before going further, we will perform some basic tests to check the content of this data frame.

  1. Count the number of genes and samples in the expression table.
  2. Display the first 10 rows and 5 columns of the matrix
  3. Draw an histogram with the values of the \(26^{th}\) column (sample) of the expression matrix.

Solution

## Check the type of the variable "expr.matrix" (should return data.frame)
class(expr.matrix)
[1] "data.frame"
## Get the dimensions of the expresion table
dim(expr.matrix)
[1] 22283   190
## Number of probesets (rows)
nrow(expr.matrix)
[1] 22283
## Number of samples (columns)
ncol(expr.matrix)
[1] 190
## Display the first 10 rows and 5 columns of the matrix
print(expr.matrix[1:10,1:5])
                 GSM338666 GSM338667 GSM338668 GSM338669 GSM338670
DDR1|1007_s_at        4.38      4.38      4.23      3.98      5.12
RFC2|1053_at          3.44      3.33      3.73      3.58      3.40
HSPA6|117_at          2.51      2.62      2.37      2.43      2.49
PAX8|121_at           6.24      6.21      6.10      6.09      6.53
GUCA1A|1255_g_at      2.29      2.23      2.17      2.26      2.27
UBA7|1294_at          5.57      5.83      5.46      5.20      6.83
THRA|1316_at          3.09      3.34      3.10      3.20      3.16
PTPN21|1320_at        2.35      2.46      2.30      2.34      2.21
CCL5|1405_i_at        2.45      2.52      2.48      2.44      2.22
CYP2E1|1431_at        2.33      2.42      2.18      2.15      2.27

Since we are generating this report with an R markdown document with RStudio, we can also use the kable() function of the R package knitr to generate a nicely formatted table.

## Generate a nicely formatted table. 
library(knitr) ## Load the R library knitr 
kable(expr.matrix[1:10,1:5]) ## Generate a nicely formatted table
GSM338666 GSM338667 GSM338668 GSM338669 GSM338670
DDR1|1007_s_at 4.38 4.38 4.23 3.98 5.12
RFC2|1053_at 3.44 3.33 3.73 3.58 3.40
HSPA6|117_at 2.51 2.62 2.37 2.43 2.49
PAX8|121_at 6.24 6.21 6.10 6.09 6.53
GUCA1A|1255_g_at 2.29 2.23 2.17 2.26 2.27
UBA7|1294_at 5.57 5.83 5.46 5.20 6.83
THRA|1316_at 3.09 3.34 3.10 3.20 3.16
PTPN21|1320_at 2.35 2.46 2.30 2.34 2.21
CCL5|1405_i_at 2.45 2.52 2.48 2.44 2.22
CYP2E1|1431_at 2.33 2.42 2.18 2.15 2.27

Distribution of expression values

Before going any further, we will draw an histogram with the distribution of all the expression values for all genes and all samples.

The R function hist() enables to draw an histogram, give a vector of values. To get more information about this function, you can type help(hist) in the R console.

Since the expression matrix is stored in and R object of class data.frame, we will first need to convert it to a vector. For this, we can use the R function unlist().

## Histogram of all expression values
hist(unlist(expr.matrix), breaks=100, 
     main="Den Boer expression values",
     xlab="RMA-normalized expression value (log2-transformed)",
     ylab="Number of measures",
     )
Distribution of RMA-normalized expression values in Den Boer expression matrix. Beware that RMA normalization includes a log2 transformation.

Distribution of RMA-normalized expression values in Den Boer expression matrix. Beware that RMA normalization includes a log2 transformation.

The global distribution of RMA-normalized expression values across for all samples and all genes ranges from 1.35 to 13.3. We should be aware that RMA normalization includes a log2 transformation.

This has important consequences

  1. The actual intensities of the measured expression are obtained with \(I = 2^x\). For this expression matrix, intensities thus range from \(2^{min(x)} = 2.55\) to \(2^{max(x)} = 10086\).

  2. Since the gene-wise group means were computed on log2-transformed values, they correspond to the geometric mean rather than the arithmetic mean.

\[\bar{x}_{i,g} = \frac{1}{n_g} \sum_{j \in g} (x_{i,j}) = \frac{1}{n_g} \sum_{j \in g} (log_2(I_{i,j})) = \frac{1}{n_g} log_2(\prod_{j \in g} (I_{i,j})) = log_2(\sqrt[n_g]{\prod_{j \in g} (I_{i,j})})\]

  1. For a given gene \(i\), the difference between group means is equivalent to a log-ratio between the non-log2-transformed values. It should be interpreted as a log2-fold change.

For a given gene \(i\), the difference between sample means of two groups (\(g_1\) and \(g_2\)) can be computed as follows.

\[d_i = \bar{x}_{i,g_2} - \bar{x}_{i,g_1}\]

The corresponding fold-change of expression can be computed as follows.

\[FC_i = 2^{d_i}\]

For a comparison between groups 1 and 2 up-regulated genes are those with a fold-change \(\gt 1\), which corresponds to a positive log2 fold change. Reciprocally, down-regulated genes are those with a fold-change \(lt 1\), and have a negative log2 fold change.

Let us take two illustrative examples of up- and down-regulation.

  1. Up-regulation fold-change of \(FC = 2\) corresponds to a log2-fold change of \(d = log_2(FC)= 1\).

\[a / b = 2 \Rightarrow log2(a/b) = log2(2) \Rightarrow log2(a) - log2(b) = 1 \Rightarrow log2(a) = log(b) + 1 \]

We can consider our \(X\) values as log2-transformed intensities of the first group (\(x = log2(a)\)), and our \(Y\) values as log2-transformed intensities of the second group (\(y = log2(b)\)).

  1. Down-regulation: a fold-change of \(FC = 1/2\) corresponds to a log2-fold change of \(d = log_2(FC) = -1\).

\[a / b = 1/2 \Rightarrow log2(a/b) = log2(1/2) \Rightarrow log2(a) - log2(b) = -1 \Rightarrow log2(a) = log(b) - 1 \]

In the rest of this tutorial, all the analyses will be led on the RMA-normalized values (this is the standard procedure), but we should always keep in mind that the difference between group means of RMA-normalized values is actually a log2-fold change.


Loading phenotypic data

We will now load the phenotypic data, which is crucial to any analysis since this data type contains the information on all the samples found in a microarray expression matrix.

## Load phenotypic data
pheno.url <- file.path(microarray.data.url, 'phenoData_GSE13425.tab')
pheno.file <- file.path(dir.data, 'phenoData_GSE13425.tab')

## Download pheno table if not yet present in data dir
if (file.exists(pheno.file)) {
  message("Pheno file already downloaded: ", pheno.file)
} else {
  message("Downloading pheno table from ", pheno.url)
  download.file(url = pheno.url, destfile = pheno.file)
  message("Local pheno file: ", expr.file)
}

## Load pheno table in memory
message("Loading pheno table from ", pheno.file)
pheno <- read.table(pheno.file, sep='\t', head=TRUE, row=1)

## Check the size of the pheno table 
dim(pheno) 
[1] 190   6
## Should give 190 4, i.e. one row per sample, and one column per information

## Get the information type (naes of the pheno table)
names(pheno)
[1] "Sample.title"               "Sample.source.name.ch1"     "Sample.characteristics.ch1" "Sample.description"         "sample.labels"              "sample.colors"             
## Extract the cancer subtype from the pheno matrix
cancer.type <- as.vector(pheno$Sample.title)

## Check the length of the vector (should be the number of patients, i.e. the number of colums of the expression matrix)
length(cancer.type)
[1] 190
## Check the 10 first values of the cancer type vector
head(cancer.type, n=10)
 [1] "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL" "T-ALL"
## Check the 10 last values of the cancer type vector
tail(cancer.type, n=10)
 [1] "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL" "pre-B ALL"
## Check some other values of the cancer type vector
print(cancer.type[140:150])
 [1] "BCR-ABL"                 "BCR-ABL"                 "BCR-ABL + hyperdiploidy" "MLL"                     "MLL"                     "MLL"                     "MLL"                     "pre-B ALL"               "pre-B ALL"               "pre-B ALL"               "pre-B ALL"              

Th R function table() permits to count the number of elements of different types within a given dataset. We can use it to count the number of samples of each type among the 190 samples. We will sort the result by decreasing size using the R function sort()

## Sort the resulting (one-column) table by decreasing number of samples.
cancer.type.counts <- sort(table(cancer.type), decreasing = TRUE)

## The funciton table() returns an object of a specific R class, named "table"
class(cancer.type.counts)
[1] "table"
## Sort the resulting (one-column) table by decreasing number of samples.
print(cancer.type.counts)
cancer.type
            hyperdiploid                pre-B ALL                 TEL-AML1                    T-ALL      E2A-rearranged (EP)                  BCR-ABL   E2A-rearranged (E-sub)                      MLL  BCR-ABL + hyperdiploidy       E2A-rearranged (E) TEL-AML1 + hyperdiploidy 
                      44                       44                       43                       36                        8                        4                        4                        4                        1                        1                        1 

We can also nicely format the result table using the knitr kable() function. However, this require to convert the table object into a data frame. For this, we can use the R function as.data.frame().

## We can also the number of samples per subtype of leukemia in a
## well-formated table. For this we convert the table in a data.frame
## object, which can be treated by the kable() function. 
kable(as.data.frame.table(cancer.type.counts),
      caption = "Number of biological samples per subtype of acute lymphoblastic leukemia in Den Boer dataset. ")
Number of biological samples per subtype of acute lymphoblastic leukemia in Den Boer dataset.
cancer.type Freq
hyperdiploid 44
pre-B ALL 44
TEL-AML1 43
T-ALL 36
E2A-rearranged (EP) 8
BCR-ABL 4
E2A-rearranged (E-sub) 4
MLL 4
BCR-ABL + hyperdiploidy 1
E2A-rearranged (E) 1
TEL-AML1 + hyperdiploidy 1

We can see that some cancer type are represented by a large number of biological samples (hyperdiploid, pre-B ALL, TEL-AML1 and T-AML) whereas other types are represented by a handful of samples, or even a single one.

Since differential analysis requires a sufficient number of observations per group, we will restrict the following steps to the four best represented subtypes in this data set.

Loading the absent/marginal/present calls

In Affymetrix, each probeset is labelled with a tag which can take three values: Absent (A), Marginal (M), or Present (P), depending on the intensity detected in the different probes representing a given transcript.

## Load Absent/Marginal/Present (AMP) calls
amp.url <- file.path(microarray.data.url, "GSE13425_AMP_Whole.txt")
amp.file <- file.path(dir.data, "GSE13425_AMP_Whole.txt")

if (file.exists(amp.file)) {
  message("Absent/Marginal/Presnt file already downloaded: ", amp.file)
} else {
  message("Downloading AMP table from ", amp.url)
  download.file(url = amp.url, destfile = amp.file)
  message("Local AMP file: ", amp.file)
}

## Load AMP table in memory
amp <- read.table(amp.file, sep="\t", head=T, row=1)

## Count the total number of calls of each type (A/M/P)
table(unlist(amp))

      A       M       P 
2871373  140251 1222146 
## Compute a matrix of Boolean values indicating, for each entry of the expression matrix, whether it is absent or not
absent.calls <- amp=="A"

## Count the number of "Absent" labels per gene as the marginal sum of absent calls matrix.
absent.per.gene <- apply(absent.calls, 1, sum)

## Count the number of genes detected in at least 30 samples
sum(absent.per.gene <= (ncol(expr.matrix) - 30))
[1] 10540

We can draw an histogram indicating the distribution of the number of absent calls per gene.

## Draw an histogram showing the number of absent calls per gene
hist(absent.per.gene, breaks=0:ncol(expr.matrix), 
     xlab="Number of absent calls", ylab="Number of probesets (genes)",
     main="Number of absent calls per probeset")
Distribution of the number of absent calls per gene.

Distribution of the number of absent calls per gene.

Interpretation of the “absent” (A) calls

At first sight, the “absent/marginal/present” calls may seem frightening: altogether, the expression matrix contains more probesets declared absent than present, and the histogram shows that several thousands of genes (4064) are declared absent in absolutely all the samples.

Note that the interpretation of “absent” labels is not obvious. Genes declared “absent” in almost all samples are often considered irrelevant because they are supposed to be undetectable with the microarray, possibly for some technical reason. However, another possibility would be that these genes are actually inactive in most samples (for example because they are generally not expressed in blood cells), and their detection in some samples might reveal a dysregulation of these genes. For this tutorial, we will thus deliberately avoid the classical “filtering” step, which consists in retaining only the genes declared present in a given number of samples.

Note that we will still be able to revise this decision at the very end of the analysis pipeline, by computing the number of genes declared differentially expressed yet absent from most samples.


Symbols and formulas

We provide in the help page a table of symbols and formulas + a short reminder of the principle of the Student t test.


Exercises

We showed above how to count the number of biological samples per subtype of ALL. In the questions below, we will refer to these subtypes (groups) as g1 and g2.

Each student should now choose two groups containing at least 30 samples, and develop a code to perform the following operations.

For example, for this tutorial I will compare the TEL-AML1 and T-ALL subtypes.

During this practical, you will be led to draw a succession of plots (scatter plots, histograms, …), which will progressively lead you to compute the formula of the \(t_{obs}\) statistics. The important is not so much the final result (you can find this formula in any textbook of statistics), but the way to it: at each step, you will have to summarize in 2 sentences what can be seen on the figure you just generated, and how it relates to the figure generated at the previous step (this is the actual exercise).


Choice of towo groups of interest

Q1. Choose two groups containing at least 30 individuals each

Count the number of samples per subtype of ALL, choose two groups (g1, g2) with at least 30 individuals, and build two vectors (g1.columns, g2.columns) indicating which columns of the matrix correspond to these respective groups.

print(data.frame(sort(table(cancer.type), decreasing=TRUE)))
                cancer.type Freq
1              hyperdiploid   44
2                 pre-B ALL   44
3                  TEL-AML1   43
4                     T-ALL   36
5       E2A-rearranged (EP)    8
6                   BCR-ABL    4
7    E2A-rearranged (E-sub)    4
8                       MLL    4
9   BCR-ABL + hyperdiploidy    1
10       E2A-rearranged (E)    1
11 TEL-AML1 + hyperdiploidy    1
## Choose the two groups of interest 
## (each student should choose two groups of her/his own, with at least 30 elements)
g1 <- "T-ALL"
g2 <- "TEL-AML1"

## Select the columns of the expression table corresponding to these two groups
g1.columns <- which(cancer.type == g1)
g2.columns <- which(cancer.type == g2)

Graphical exporation of the group-wise sample means

Q2. For each gene, compute the mean expression level across all the 190 samples of the expression matrix.

For this, we use the R function apply(), which applies any function to either the rows or the columns of a matrix.

mean.per.gene <- apply(expr.matrix, 1, mean, na.rm=TRUE)

Q3. Compute the group-wise mean for your two respective groups of interest.

We will now apply the command mean() to a subset of the columns restricted to the selected groups.

## Compute the mean for the whole dataset
all.means <- apply(expr.matrix, 1, mean, na.rm=TRUE)
 
## Compute the mean per group in two separate vectors
## (one vector per group, with one entry per gene)
g1.means <- apply(expr.matrix[,g1.columns], 1, mean, na.rm=TRUE)
g2.means <- apply(expr.matrix[,g2.columns], 1, mean, na.rm=TRUE)

Rather than working with a lot of separate vectors for each computed gene-wise statistics (group means, variances, standard deviations, …), we will create a data frame with one row per gene, and one column per statistics.

## Create a result table, which will contain the results of the successive computations
result.table <- data.frame(m = all.means,
                           m1 = g1.means,
                           m2 = g2.means)

## Check the first entries of the result table
kable(head(result.table),
      caption="First rows of the result table, used to collect different statistics per gene.",
      digits=2)
First rows of the result table, used to collect different statistics per gene.
m m1 m2
DDR1|1007_s_at 4.57 4.20 4.59
RFC2|1053_at 3.17 3.37 3.06
HSPA6|117_at 2.73 2.78 2.86
PAX8|121_at 6.16 6.10 6.15
GUCA1A|1255_g_at 2.18 2.18 2.18
UBA7|1294_at 5.60 5.46 5.73

In the next steps we will progressively add columns to this table. At the end of the analysis, this will enable us to export the totality of the result, or selected columns if we prefer.


Q4. Mean comparison plot.

  1. Draw a plot comparing the mean expression levels of the two groups.
  2. Draw some milestones lines indicating (i) the absence of effect, (ii) an up-regulation with a fold-change 2; (iii) a down-regulation with a fold-change 1/2.
  3. Comment this plot: can you distinguish genes that seem obviously expressed differentially?

We first present a minimal solution.

## Generate a rough plot comparing the mean expression values between the two groups of interest.
plot(result.table$m1, result.table$m2)

We can improve the plot as follows

  • add axes labels (xlab, ylab)
  • add a title (main)
  • choose smaller point size
  • use colours according to the local density of points, in order to enforce visually the zones where many points pile up.
  • draw a grid (grid())
  • use same limits for X and Y axes and start at 0
  • Add to the plot some lines to denote the milestones:

    1. A diagonal line (\(y = x\)) to mark the equality between sample means (no effect) (abline())
    2. Horizontal and vertical lines \(x=0\) and \(y=0\), indicating the boundaries of possible values (expression cannot be negative).
    3. The zero values on the X and Y axes, which are absolute boundaries, since intensity values must be positive by definition.
    4. A fold-change of 2 (remember, data were log2-transformed during normalization).
    5. A fold-change of 1/2.

For this we use the R function abline(), which can be used in three modalities:

  • abline(v=x) to draw a vertical line crossing the X axis at some value x.
  • abline(h=y) to draw an horizontal line crossing the Y axis at some value y.
  • abline(a=a, b=b) to draw a line with the equation \(y = a + bx\).
## Compute the axis range: we want to go from 0 (the minimal possible
## expression value) to the first integer that exceeds (ceiling) all 
## expression values from group 1 and group 2.
axis.range <- c(0, ceiling(max(result.table[,c("m1","m2")])))
plot(result.table$m1, 
     result.table$m2,
     main="Comparison between group means",
     xlab=paste(g1, "mean (RMA-norm log2 values)"),
     ylab=paste(g2, "mean (RMA-norm log2 values)"),
     xlim=axis.range, ylim=axis.range,
     pch=20,
     col = densCols(x=result.table$m1,
                    y=result.table$m2)
)
grid(lty="solid", col="darkgray")

## Add some milestone lines
abline(h=0, col="black")
abline(v=0, col="black")
abline(a=0, b=1)
abline(a=1, b=1, col="red")
abline(a=-1, b=1, col="red")
Comparison between group-wise means of two subtypes of acute lymphoblastic leukemia. Each dot represents one gene. Note that the values are RMA-normalized, and thus log2-transformed. The log-fold change thresholds (red lines) thus appear parallel to the diagonal.

Comparison between group-wise means of two subtypes of acute lymphoblastic leukemia. Each dot represents one gene. Note that the values are RMA-normalized, and thus log2-transformed. The log-fold change thresholds (red lines) thus appear parallel to the diagonal.

Note: as discussed above, all values were log2-transformed during the preprocessing (RMA normalization). Consequently, the mean comparison plot,

  • the 2-fold change (\(FC = 2\)) corresponds to a line \(x = y + 1\), which is parallel to and above the diagonal
  • the 1/2-fold change (\(FC = 0.5\)) corresponds to a line \(x = y - 1\), which is a parallel to and below the diagonal.

Group-wise variances and standard deviations

In the plots above we saw that many genes seem to have a quite different mean expression values between the two selected groups of ALL. However, the difference between two means must be interpreted with caution, because it can reflect either an effect related to the populations themselves (the two cancer types), or stochastic variations of the measurements, which depend on the particular samples that were drawn (sampling fluctuations).

The classical way to solve this issue is to interpret the difference between means relative to the variance.


Q5. Compute the variance and standard deviation of each gene: (i) for all the samples together, irrespective of their sub-type. (ii) For each of the two sampe types that you chose.

Beware, the R function var() does not compute the sample variance (\(s^2\)), but rather a sample-based estimate of the population variance (\(\hat{\sigma}^2\)). We thus need to apply the inverse correction as the one used to estimate population variance from sample variance.

\[s^2 = \hat{\sigma}^2 \cdot \frac{n-1}{n}\]

## Compute sample variance for all genes together
n <- ncol(expr.matrix)
result.table$var<- (n-1)/n *  apply(expr.matrix, 1, var, na.rm=TRUE)

## Compute the sample variance for each group, knowing that the var() 
## function returns the corrected estimate of population variance. 
n1 <- length(g1.columns) ## Count the number of samples in group 1
result.table$var.g1  <- (n1-1)/n1*apply(expr.matrix[,g1.columns], 1, var, na.rm=TRUE)
n2 <- length(g2.columns) ## Count the number of samples in group 2
result.table$var.g2  <- (n2-1)/n2*apply(expr.matrix[,g2.columns], 1, var, na.rm=TRUE)

## Compute standard deviations
result.table$sd <- sqrt(result.table$var) ## Sample sd for the whole expression matrix
result.table$sd.g1 <- sqrt(result.table$var.g1) ## Sample sd for group 1
result.table$sd.g2 <- sqrt(result.table$var.g2) ## Sample sd for group 2


## Check the result
kable(head(result.table), digits=2,
      caption="First rows of the result table, used to collect different statistics per gene.")
First rows of the result table, used to collect different statistics per gene.
m m1 m2 var var.g1 var.g2 sd sd.g1 sd.g2
DDR1|1007_s_at 4.57 4.20 4.59 0.19 0.08 0.13 0.43 0.28 0.37
RFC2|1053_at 3.17 3.37 3.06 0.06 0.09 0.03 0.25 0.30 0.18
HSPA6|117_at 2.73 2.78 2.86 0.45 0.61 0.77 0.67 0.78 0.88
PAX8|121_at 6.16 6.10 6.15 0.03 0.03 0.03 0.17 0.18 0.16
GUCA1A|1255_g_at 2.18 2.18 2.18 0.01 0.01 0.01 0.09 0.08 0.09
UBA7|1294_at 5.60 5.46 5.73 0.19 0.19 0.15 0.44 0.43 0.38

Q6. Variance comparison.

  1. Draw a plot comparing the variances of each gene (one dot per gene) between your two groups of interest.
  2. Draw a plot comparing the standard deviation of each gene (one dot per gene) between your two groups of interest.
  3. Comment these plots: does it seem reasonable to assume homoscedasticity?
par(mfrow=c(1,2))

## Compare gene-wise variances between the two groups.
max.var <- max(result.table[, c("var.g1","var.g2")])
var.range <- c(0, ceiling(max.var))
plot(result.table$var.g1,
     result.table$var.g2,
     main="Group-wise variances",
     xlab=paste("Sample variance for", g1),
     ylab=paste("Sample variance for", g2),
     xlim=var.range, ylim=var.range, 
     col=densCols(result.table$var.g1, result.table$var.g2)
)
grid(lty="solid", col="grey")
abline(col="black", a=0, b=1) ## Plot the diagonal (no difference between variances)


## Compare gene-wise standard deviations between the two groups.
max.sd <- max(result.table[, c("sd.g1","sd.g2")])
sd.range <- c(0, ceiling(max.sd))
plot(result.table$sd.g1,
     result.table$sd.g2,
     main="Group-wise standard dev.",
     xlab=paste(g1, "sample standard deviation"),
     ylab=paste(g2, "sample standard deviation"),
     xlim=sd.range, ylim=sd.range, 
     col=densCols(result.table$sd.g1, result.table$sd.g2)
)
grid(lty="solid", col="grey")
abline(col="black", a=0, b=1) ## Plot the diagonal (no difference between standard deviations)
Comparison between gene-wise variances for group 1 (X) and group 2 (Y).

Comparison between gene-wise variances for group 1 (X) and group 2 (Y).

par(mfrow=c(1,1))

Means versus variances

Q7. Draw a plot showing the standard deviation (all samples of all groups together) as a function of the mean expression value.

## Draw a plot comparing the variance (Y) to the mean (X) for the whole expression table
plot(result.table$m, result.table$sd, 
     main=paste("All samples: mean vs var"),
     xlab=paste("Sample mean"),
     ylab=paste("Sample standard deviation"),
     col=densCols(result.table$m, result.table$s2)
     )
grid(lty="solid", col="gray")
Variance (Y) versus mean (X) expression value for each gene (dot) across all the samples.

Variance (Y) versus mean (X) expression value for each gene (dot) across all the samples.


Q8. Draw, for each group separately, a plot comparing sample mean and sample variance

We will set the R graphical parameter mfrow to draw the two graphs side-by-side for the sake of comparison.

## Draw a plot comparing the variance (Y) to the mean (X) for each group
par(mfrow=c(1,2))
plot(result.table$m1, result.table$var.g1, 
     main=paste(g1, "mean vs var"),
     xlab=paste(g1, "sample mean"),
     ylab=paste(g1, "sample variance"),
     col=densCols(result.table$m1, result.table$var.g1)
     )
grid(lty="solid", col="gray")
plot(result.table$m2, result.table$var.g2, 
     main=paste(g2, "mean vs var"),
     xlab=paste(g2, "sample mean"),
     ylab=paste(g2, "sample variance"),
     col=densCols(result.table$m2, result.table$var.g2)
     )
grid(lty="solid", col="gray")

par(mfrow=c(1,1))

What did we learn so far?

  1. On the mean comparison plot, most genes are aligned along the diagonal, with some fluctuations. Some genes are however clearly distant, suggesting that they are over-expressed in TEL-AML1 relative to T-ALL. Other points are clearly much lower than the diagonal, suggesting that they are over-expressed in T-ALL relative to TEL-AML1.

  2. The variance and standard deviation plots show very large differences between group 1 and group 2. This is a general property, which does not seem to be restricted to genes with perturbed expression. Indeed, in contrast with the mean comparison plot, the points do not seem to be aligned along the diagonal. Instead, most points seem to spread all over the lower square. In such conditions, we could hardly support the assumption of homoscedasticity (equal variance between groups), and we should thus apply Welch’s test rather than Student. We will come back to this point later.

  3. The mean versus variance plots show us that different genes can have very different variances, even though they have similar means. This means that the differences between the means will have to be evaluated relative to the variances. This effect is visible when the variance is computed for all the samples together, but also for each one of the two groups of interest, when the plots are drawn separately.


The MA plot: log fold changes versus average expression

Q9. Draw a scatter plot representing one dot per gene, with the effect size (variable \(d\), ordinate) as a function of the average between these subtype-specific expression means (variable \(A\), abcsissa). Comment on the relationship between average expression level of expression of a gene and its expression mean.*

## The effect size is the difference between group means.
## Since the RMA-normalized data is log2-transformed, this effect size 
## is equivalent to the "M" variable of the MA plots. 
result.table$d <- result.table$m1 - result.table$m2

## Mean of the two group means.
## Since the RMA-normalized data is log2-transformed, this effect size 
## is equivalent to the "A" variable of the MA plots.  
result.table$A <- (result.table$m1 + result.table$m2)/2

plot(result.table$A,
     result.table$d,
     main="MA plot",
     xlab= "Average expression: A = (m1 + m2)/2",
     ylab = "Effect size: d = m2 - m1",
     col=densCols(result.table$d, result.table$A)
     )
grid(lty="solid", col="grey")
abline(col="black", h=0)
abline(col="red", h=c(1, -1))
Effect size (Y) versus average expression (X). Since the values were log2-transformed during the RMA normalization procedure, this plot is equivalent to the MA plot representation which became popular with two-color microarrays.

Effect size (Y) versus average expression (X). Since the values were log2-transformed during the RMA normalization procedure, this plot is equivalent to the MA plot representation which became popular with two-color microarrays.

Let us keep in mind that the RMA-normalized values are log2-transformed. Consequently:

  • the difference between means (\(d\)) corresponds to the log2-fold change between the two conditions (the \(M\) variable of the MA plots),
  • the average between group-wise means corresponds to a geometric mean between the intensity values (the \(A\) variable of MA plots).

The graph we just drawn thus corresponds to the classical MA plot, representing the log2-fold-change (\(M\)) as a function of the geometric mean of expression (\(A\)).

Note the topological correspondence between this plot and the mean comparison plot above.

  • the diagonal of the group-wise mean comparison plot became the horizontal axis of the MA plot
  • the fold-change thresholds are replaced by horizontal lines: for any \(y\) value on the Y axis, \(A \ge y\) is equivalent to a fold-change \(FC \ge 2^y\), and conversely \(A \le y\) is equivalent to \(FC \le 2^y\). On the plot we marked in red the threshold lines for \(A \ge 1 \equiv FC \ge 2\) and \(A \le -1 \equiv FC \le 0.5\).

Estimating the pooled standard deviation

Q10. For each gene, compute the pooled standard deviation.

We can simply derive the standard deviations from the above computed variances. The pooled standard deviation is computed according to the following formula (see the help page for more details]).

\[\hat{\sigma}_p = \sqrt{\frac{n_1 s_1^2 + n_2 s_2^2}{n_1+n_2-2}}\]

\[s = \sqrt{s^2}\]

## Compute pooled standard deviation
result.table$var.pooled <- (n1* result.table$var.g1 + n2* result.table$var.g2) / (n1+n2-2)
result.table$sd.pooled <- sqrt(result.table$var.pooled)

## Check the result
kable(head(result.table), digits=2,
      caption="First rows of the result table, used to collect different statistics per gene.")
First rows of the result table, used to collect different statistics per gene.
m m1 m2 var var.g1 var.g2 sd sd.g1 sd.g2 d A var.pooled sd.pooled
DDR1|1007_s_at 4.57 4.20 4.59 0.19 0.08 0.13 0.43 0.28 0.37 -0.40 4.40 0.11 0.33
RFC2|1053_at 3.17 3.37 3.06 0.06 0.09 0.03 0.25 0.30 0.18 0.31 3.22 0.06 0.25
HSPA6|117_at 2.73 2.78 2.86 0.45 0.61 0.77 0.67 0.78 0.88 -0.07 2.82 0.72 0.85
PAX8|121_at 6.16 6.10 6.15 0.03 0.03 0.03 0.17 0.18 0.16 -0.05 6.13 0.03 0.17
GUCA1A|1255_g_at 2.18 2.18 2.18 0.01 0.01 0.01 0.09 0.08 0.09 -0.01 2.18 0.01 0.09
UBA7|1294_at 5.60 5.46 5.73 0.19 0.19 0.15 0.44 0.43 0.38 -0.27 5.59 0.17 0.41

Q11. Draw a scatter plot with the pooled standard deviation (ordinate) as a function of the effect size (abcsissa). Comment on the relationship between inter-group differences and dispersion (standad deviation).

## Plot the pooled standard deviations as a function of the effect size. 
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
     result.table$sd.pooled, ylab="pooled standard deviation",
     main = "Standard deviation vs. effect size",
     col=densCols(result.table$d, result.table$sd.pooled)
     )
grid(col="grey", lty="solid")

## Line marking the absence of effect
abline(v=0, col="black") 

## Line marking the inferior boundary of standard deviations (cannot be negative)
abline(h=0, col="black") 

## Lines marking arbitrary effect sizes of 1 and -1 respectively,
## which correspond to fold-changes of 2 and 1/2 in raw microarray 
## intensities.
abline(v=c(-1,1), col="red") 
Pooled standard deviation (Y) as a function of the effect size (X)

Pooled standard deviation (Y) as a function of the effect size (X)

The plot above shows that the same effect size (e.g. a difference of \(d=2\) between the two groups) can be associated to very different standard deviations. We thus get the feeling that a given effect size should not be considered as significant when they are associated with a high rather than a low value for the pooled standard deviation. Indeed, when a gene is associated to a high sample variability, our estimation of the difference between groups can fluctuate, and we might have obtained another estimation of the difference with a different sampling.

This is precisely what the \(t\) statistics serves for: it provides an interpretation the difference between the means relative to the estimated sampling variations. In the next section we will compute it.


Q12. Compute Student \(t\) statistics (\(t_{obs}\)) for each gene. Draw a plot representing this statistics (ordinate) as a function of the effect size (abcsissa). Comment on this plot and compare it to the previous ones.

\(t_{obs} = \frac{\hat{\delta}}{\hat{\sigma}_\delta} = \frac{\bar{x}_2 - \bar{x}_1}{\sqrt{\frac{n_1 s_1^2 + n_2 s_2^2}{n_1+n_2-2} \left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}}\)

## Computation of Student t statistics
result.table$t.Student <- result.table$d / (result.table$sd.pooled * sqrt(1/n1+1/n2))

## Plot the pooled standard deviations as a function of the effect size. 
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
     result.table$t.Student, ylab="Student t statistics",
     main = "Student t statistics vs. effect size",
     col=densCols(result.table$d, result.table$t.Student)
     )
grid(col="grey", lty="solid")

## Lines marking the absence of effect on both effect size and Student t statistics
abline(v=0, col="black") 
abline(h=0, col="black") 

## Lines marking arbitrary effect sizes of 1 and -1 respectively, 
## which correspond to fold-changes of 2 and 1/2 in the raw microarray 
## intensities. 
abline(v=c(-1,1), col="red") 

We can see here that the Student statistics (\(t_S\)) is correlated to the effect size (\(d\)), yet the two statistics show important differences: a given value of effect size (e.g. \(d=1\), denoted by a red vertical line on the plot) corresponds to a wide range of values for the \(t_S\) statistics.

As shown in the table of symbols and formulas, Student statistics has been obtained by dividing the effect size (the raw difference between the two sample means) by an estimation of the standard deviation of this difference (\(\hat{\sigma}_{delta}\)).

In summary, Student’s \(t\) statistics (\(t_S\)) is a way to measure the difference between group means relative to their dispersion.


Student test and p-value

Q13. Using Student probability function (R function pt()), compute the p-value of each gene.

For this, we need to define the degrees of freedom. In a two-sample t-test (as in our case), the degrees of freedom are computed as the number of elements in the two groups (\(n_1\), \(n_2\)) minus the number of means estimated from these elements (we estimated the means of group 1 and group 2, respectively). Thus: \(df = n_1 + n_2 - 2\).

We will perform a two-sided test, i.e. we want to estimate the significance for negative as well as positive differences between the two means. The simplest way to do this is to compute the p-value of the absolute difference, and to multiply it by two. This gives us the probability to observe by chance a \(t_S\) value at least as distant from the null value, on either tails of the distribution.

## Define the sign of the difference: up (m2 > m1) or down (m1 < m2)
result.table$sign <- "null"
result.table$sign[result.table$d > 0] <- "up"
result.table$sign[result.table$d < 0] <- "down"

## Count the number of up, down or null signs
print(table(result.table$sign))

 down  null    up 
11157     2 11124 
## Computation of Student p-value.
## The attribute "lower.tail" will be TRUE if the difference is negative
result.table$p.value <- 2*pt(abs(result.table$t.Student), 
                             df=n1 + n2 - 2, lower.tail = FALSE)

Volcano plot

Q14. Draw a plot showing the p-value as a function of the effect size, and a so-called volcano plot, representing -log10(p-value) as a function of the effect size.

par(mfrow=c(1, 2))

################################################################
## Plot the pooled standard deviations as a function of the effect size. 
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
     result.table$p.value, ylab="Student p-value",
     main = "Student p-value vs. effect size",
     col=densCols(result.table$d, result.table$p.value)
     )
grid(col="grey", lty="solid")

## Line marking the null effect
abline(v=0, col="black") 

## Lines marking arbitrary effect sizes of 1 and -1 respectively, 
## which correspond to fold-changes of 2 and 1/2 in the raw microarray 
## intensities. 
abline(v=c(-1,1), col="red") 

## Line marking an arbitrary threshold of 5% on the p-value
abline(h=0.01, col="red")

################################################################
## Draw a volcano plot
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
     -log10(result.table$p.value), ylab="-log10(p-value)",
     main = "Volcano plot",
     col=densCols(result.table$d, -log10(result.table$p.value))
     )
grid(col="grey", lty="solid")

## Line marking the null effect
abline(v=0, col="black") 

## Lines marking arbitrary effect sizes of 1 and -1 respectively, 
## which correspond to fold-changes of 2 and 1/2 in the raw microarray 
## intensities. 
abline(v=c(-1,1), col="red") 

## Line marking an arbitrary threshold of 5% on the p-value
abline(h=-log10(0.05), col="red")
(a) P-value as a function of the effect size. The vertical red lines indicate arbitrary thresholds on the effect size, respectively corresponding to fold changes of 1/2 and 2 on the fold change. The horizontal line shows a threshold of 5% on the p-value. Genes located below this line (i.e. with a low p^value) are declared significant. (b) Volcano plot, showing the -log10(p-value) as a function of the effect size. The horizontal red line denotes the same threshold of 5% on the p-value. Genes above this line are declared significant.

  1. P-value as a function of the effect size. The vertical red lines indicate arbitrary thresholds on the effect size, respectively corresponding to fold changes of 1/2 and 2 on the fold change. The horizontal line shows a threshold of 5% on the p-value. Genes located below this line (i.e. with a low p^value) are declared significant. (b) Volcano plot, showing the -log10(p-value) as a function of the effect size. The horizontal red line denotes the same threshold of 5% on the p-value. Genes above this line are declared significant.

par(mfrow=c(1,1))

Q15. Draw an histogram of the p-values with 20 bins. Comment on the shape of the distribution.

## Histogram of the p-value distribution
hist(result.table$p.value, breaks=20, main="P-value histogram", xlab="Pvalue", ylab="Number of probesets", col="grey")

################################################################
## Check that the result for a particular gene is consistent with R t.test() function
test.gene <- 1
test.gene.student <- t.test(x=expr.matrix[test.gene, cancer.type==g1], y = expr.matrix[test.gene, cancer.type==g2], var.equal = TRUE)
print(test.gene.student)

    Two Sample t-test

data:  expr.matrix[test.gene, cancer.type == g1] and expr.matrix[test.gene, cancer.type == g2]
t = -5.2684, df = 77, p-value = 1.215e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5489154 -0.2477901
sample estimates:
mean of x mean of y 
 4.195833  4.594186 
result.table$m1[test.gene]
[1] 4.195833
result.table$m2[test.gene]
[1] 4.594186
result.table$t.Student[test.gene]
[1] -5.268382
result.table$p.value[test.gene]
[1] 1.215063e-06

Q16. Compute the expected number of false positives if we set the p-value threshold to 0.05.

################################################################
## Expected number of positives below a given alpha threshold

## selected significance threshold (somewhat arbitrary)
alpha <- 0.05 

## Count the number of genes declared significant with this alpha on the nominal p-value
n.positive <- sum(result.table$p.value < alpha)
n.tests <- nrow(result.table) ## Number of probesets (genes) tested
f.positive <-  n.positive / n.tests ## Proportion of genes declared positive
exp.n.positive <- alpha * n.tests

Q17. Compute the *e-value of each gene.

The e-value of a gene is the number of false positives that would be expected if we set the threshold to the p-value that was obtained for this particular gene. In other terms, if we consider this gene (and all the genes having a smaller or equal p-value) as significant, how many false positives would we expect ?

The e-value is simply obtained by multiplying the nominal p-value (i.e. the p-value computed for each individual test) by the number of tests (in our case, the number of genes in the expression matrix).

################################################################
## Compute the e-value
n.tests <- nrow(expr.matrix)
result.table$e.value <- result.table$p.value * n.tests
result.table$sig <- -log10(result.table$e.value) ## Significance

Q18 Draw a scatter plot of the e-value (ordinate) as a function of the effect size. On this plot, draw a line marking a reasonable threshold on the e-value and jusify your choice. Count the number of genes declared positive with this threshold.

################################################################
## Draw a volcano plot with the e-value rather than p-value
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
     result.table$sig, ylab="sig = -log10(e-value)",
     main = "E-value volcano plot",
     col=densCols(result.table$d,result.table$sig),
     ylim=c(min(result.table$sig), max(result.table$sig)*1.2)
     )
grid(col="grey", lty="solid")

## Line marking the null effect
abline(v=0, col="black") 

## Lines marking arbitrary effect sizes of 1 and -1 respectively, 
## which correspond to fold-changes of 2 and 1/2 in the raw microarray 
## intensities. 
abline(v=c(-1,1), col="red") 

## Line marking an arbitrary threshold of 5% (stringent) or 1% (lenient) on the e-value
abline(h=-log10(alpha), col="red") ## Stringent threshold: e-value <= 0.05
abline(h=-log10(1), col="red", lty="dashed") ## Lenient threshold : e-value <= 1

## Draw a legend with the number of genes declared positive
legend("topleft", 
       legend=c(paste("e-value <=", alpha, ":", sum(result.table$e.value <= alpha)),
              paste("e-value <=", 1, ":", sum(result.table$e.value <= 1))),
              lty=c("solid","dashed"),
              col="red",
              lwd=2)


Q19. Compute the False Discovery Rate (FDR) using the R function stats::p.adjust(). Draw a volcano plot with FDR rather than p-values. Compare the p-value, e-value and FDR, and discuss about their respective interpretations.

################################################################
## Compute the FDR (False Discovery Rate)
result.table$fdr <- stats::p.adjust(result.table$p.value, method="fdr")

## Compare FDR to p-value
plot(result.table$p.value, result.table$fdr, log="xy"); abline(a=0, b=1); grid()
## This FDR seems strange, it does not correspond to my computations, and it seems to be a monotonous function of the p-value, which should not be the case

## Compute the q-value
# source("https://bioconductor.org/biocLite.R")
# biocLite("qvalue")
library(qvalue)

################################################################
## Draw a volcano plot with the e-value rather than p-value
plot(result.table$d, xlab="Effect size (d = m2 - m1)",
     -log10(result.table$fdr), ylab="sig = -log10(FDR)",
     main = "FDR volcano plot",
     col=densCols(result.table$d,-log10(result.table$fdr)),
     ylim=c(min(-log10(result.table$fdr)), max(-log10(result.table$fdr))*1.2)
     )
grid(col="grey", lty="solid")

## Line marking the null effect
abline(v=0, col="black") 

## Lines marking arbitrary effect sizes of 1 and -1 respectively, 
## which correspond to fold-changes of 2 and 1/2 in the raw microarray 
## intensities. 
abline(v=c(-1,1), col="red") 

## Line marking an arbitrary threshold of 5% on the FDR
abline(h=-log10(alpha), col="red") 

## Draw a legend with the number of genes declared positive
legend("topleft", 
       legend=c(paste("e-value <=", alpha, ":", sum(result.table$e.value <= alpha)),
              paste("e-value <=", 1, ":", sum(result.table$e.value <= 1))),
              lty=c("solid","dashed"),
              col="red",
              lwd=2)

## Negative control set

In order to get a feeling of the expected behaviour of the test under null hypothesis (in absence of differences between the two groups), we will generate a negative control set, i.e. an artificial dataset compliant with the assumptions of the Welch test (normal populatins with different variances), and measure the same statistics as with the expression matrix.

In addition, we will calibrate our random dataset in order to simulate the actual expression matrix in the following way.

  1. The artificial data will be drawn in random normal populations of equal means (\(H_0= \mu_1 = \mu2\)) but where the variances (\(\sigma1\), \(\sigma2\)) may differ between groups (heteroscedasticity).

  2. For each row of the random dataset, the population means will be set equal to the mean expression value of the corresponding row (gene) of the expression table.

We will then generate the same plots as above, but systematically compare the results obtained with the expression matrix and with the random data, respectively.

g <- nrow(expr.matrix)

## First group
n1 <- length(g1.columns)
mu1 <- apply(expr.matrix[, c(g1.columns, g2.columns)], 1, mean, na.rm=TRUE)
sigma1 <- apply(expr.matrix[, g1.columns], 1, sd, na.rm=TRUE)

## Second group
n2 <- length(g2.columns)
mu2 <- apply(expr.matrix[, c(g1.columns, g2.columns)], 1, mean, na.rm=TRUE)
sigma2 <- apply(expr.matrix[, g2.columns], 1, sd, na.rm=TRUE)

group1 <- matrix(nrow=g, rnorm(n = g * n1, mean = mu1, sd = sigma1))
# View(group1)

group2 <- matrix(nrow=g, rnorm(n = g * n2, mean = mu2, sd = sigma2))
dim(group2)
[1] 22283    43
## Merge the two datasets in a single matrix
rnorm.data <- cbind(group1, group2)
class(rnorm.data)
[1] "matrix"
dim(rnorm.data)
[1] 22283    79
rand.result.table <- data.frame(
  m1 = apply(group1, 1, mean),
  m2 = apply(group2, 1, mean))

par(mfrow=c(1,2))
## Compute the axis range: we want to go from 0 (the minimal possible
## expression value) to the first integer that exceeds (ceiling) all 
## expression values from group 1 and group 2.
axis.range <- c(0, ceiling(max(result.table[,c("m1","m2")])))
plot(result.table$m1, 
     result.table$m2,
     main="Comparison between group means\nDen Boer, 2009",
     xlab=paste(g1, "mean (RMA-norm log2 values)"),
     ylab=paste(g2, "mean (RMA-norm log2 values)"),
     xlim=axis.range, ylim=axis.range,
     pch=20,
     col = densCols(x=result.table$m1,
                    y=result.table$m2)
)
grid(lty="solid", col="darkgray")
abline(a=0, b=1)
abline(a=1, b=1, col="red")
abline(a=-1, b=1, col="red")

plot(rand.result.table$m1, 
     rand.result.table$m2,
     main="Random dataset under H0",
     xlab=paste("Group 1 mean"),
     ylab=paste(g2, "Group 2 mean"),
     xlim=axis.range, ylim=axis.range,
     pch=20,
     col = densCols(x=rand.result.table$m1,
                    y=rand.result.table$m2)
)
grid(lty="solid", col="darkgray")
# plot(rand.mean1, rand.mean2,
#      col="grey", main="Negative control\n(random normal under H0)", xlab="m1", ylab="m2"
#      )
# grid()
abline(a=0, b=1)
abline(a=1, b=1, col="red")
abline(a=-1, b=1, col="red")
Comparison of sample means between the two groups.

Comparison of sample means between the two groups.

par(mfrow=c(1,1))
## The effect size is the difference between group means.
## Since the RMA-normalized data is log2-transformed, this effect size 
## is equivalent to the "M" variable of the MA plots. 
rand.result.table$d <- rand.result.table$m1 - rand.result.table$m2

## Mean of the two group means.
## Since the RMA-normalized data is log2-transformed, this effect size 
## is equivalent to the "A" variable of the MA plots.  
rand.result.table$A <- (rand.result.table$m1 + rand.result.table$m2)/2

par(mfrow=c(1,2))
xrange <- range(result.table$A)
yrange <- range(result.table$d)
plot(result.table$A,
     result.table$d,
     main="MA plot\nDen Boer, 2009",
     xlab= "Average expression: A = (m1 + m2)/2",
     ylab = "Effect size: d = m2 - m1",
     xlim=xrange,
     ylim=yrange,
     col=densCols(result.table$d, result.table$A)
     )
grid(lty="solid", col="grey")
abline(col="black", h=0)
abline(col="red", h=c(1, -1))

plot(rand.result.table$A,
     rand.result.table$d,
     main="MA plot\nRandom dataset under H0",
     xlab= "Average expression: A = (m1 + m2)/2",
     ylab = "Effect size: d = m2 - m1",
     xlim=xrange,
     ylim=yrange,
     col=densCols(rand.result.table$d, rand.result.table$A)
     )
grid(lty="solid", col="grey")
abline(col="black", h=0)
abline(col="red", h=c(1, -1))

par(mfrow=c(1,1))

References

Den Boer, Monique L, Marjon van Slegtenhorst, Renée X De Menezes, Meyling H Cheok, Jessica G C A M Buijs-Gladdines, Susan T C J M Peters, Laura J C M Van Zutven, et al. 2009. “A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study.” The Lancet Oncology 10 (2): 125–34.