Abbreviations

Abbrev Meaning
ALL acute lymphoblastic leukemia
AML acute myeloblastic leukemia
DEG differentiall expressed genes
GEO Gene Expression Omnibus database
KNN K-nearest neighbours
LDA linear discriminant analysis
QDA quadratic discriminant analysis
PCA principal component anlaysis
LOO leave-one-out
CV
## Check the requirement for some packages
packages <- c("knitr")
for (pkg in packages) {
  if (!suppressPackageStartupMessages(require(pkg, quietly=TRUE, character.only = TRUE))) {
    install.packages(pkg)
  }
  library(pkg, character.only = TRUE)
}

## Install some bioconductor packages
pkg <- "qvalue"
if (!suppressPackageStartupMessages(require(pkg, quietly=TRUE, character.only = TRUE))) {
  source("http://bioconductor.org/biocLite.R")
  biocLite();
  biocLite(pkg)
}
library(pkg, character.only = TRUE)


## Load the curstom function t.test.multi() - see protocol on differential expression
## Load a custom the library for multiple t tests
url.stats4bioinfo <- "http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics"
source(file.path(url.stats4bioinfo, 'R-files/config.R'))
[1] "Data repository http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/data"
[1] "R scripts source http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/R-files"
[1] "Results will be saved to /Users/jvanheld/course_stats_bioinfo/results"
[1] "Figures will be saved to /Users/jvanheld/course_stats_bioinfo/figures"
source(file.path(url.stats4bioinfo, 'R-files/util/util_student_test_multi.R'))

Goal of this tutorial

In this tutorial, we will put in practice some fundamental concepts of supervised classification.

  1. Distinction between unsupervised (clustering) and supervised classification.
  2. Steps of an analysis: training, testing and prediction.
  3. Evaluation of the classification: cross-validation (CV), leave-one-out (LOO).
  4. The problem of over-dimensionality and the risk of overfitting
  5. Feature selection methods

Study case

For the practical, we will use a cohort comprized of 190 samples from patients suffering from Acute Lymphoblastic Leukemia (ALL) from DenBoer (2009). The raw data has previously been retrieved from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/).


R configuration

  • Open a terminal
  • Start R

We first need to define the URL of the course (dir.base), from which we will download some pieces of R code and the data sets to analyze.

## Specify the URL of the base for the course
dir.base <- 'http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics'
dir.base.ASG1 <- 'http://pedagogix-tagc.univ-mrs.fr/courses/ASG1'

The following command loads a general configuration file, specifying the input directories (data, **R*** utilities) and creating output directories on your computer (the corresponding paths are stored in the variables dir.results and dir.figures).

## Load the general configuration file
source(file.path(dir.base, 'R-files', 'config.R'))
[1] "Data repository http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/data"
[1] "R scripts source http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/R-files"
[1] "Results will be saved to /Users/jvanheld/course_stats_bioinfo/results"
[1] "Figures will be saved to /Users/jvanheld/course_stats_bioinfo/figures"
setwd(dir.results)
print(paste("Result directory", dir.results))
[1] "Result directory /Users/jvanheld/course_stats_bioinfo/results"

Data loading

We can now load the expression matrix, and check its dimensions.

Beware: this expression matrix weights 20Mb. The download can thus take time, depending on your internet connection. To avoid downloading it repeadedly, we will stored it in our local directory.

# ## Define the location of data directory and file containing expression profiles
# denboer.url <- file.path(dir.base, 'data', 'gene_expression','denboer_2009')
denboer.url <- file.path(dir.base.ASG1, "data", "marrays")
url.expr.matrix <- file.path(denboer.url, 'GSE13425_Norm_Whole.tab')

## Download and store a local copy of the expression matrix if not yet there.
file.expr.matrix <- file.path(dir.results, "GSE13425_Norm_Whole.tab")
if (!file.exists(file.expr.matrix)) {
  message("Downloading expression matrix from URL: ", url.expr.matrix)
  download.file(url = url.expr.matrix, destfile = file.expr.matrix)
  message("Local copy of expression matrix stored in file: ", file.expr.matrix)
}

## Load the dataset from denboer2009
message("Loading expression matrix from file ", file.expr.matrix)
denboer2009.expr <- read.table(file.expr.matrix, sep = "\t", head = T, row = 1)
print(dim(denboer2009.expr))
[1] 22283   190
## Should give this: 22283   190


## Load the pheno table, which provides information about all the samples
## Load phenotypic data
pheno.url <- file.path(denboer.url, 'phenoData_GSE13425.tab')
pheno.file <- file.path(dir.results, 'phenoData_GSE13425.tab')
if (!file.exists(pheno.file)) {
  message("Downloading pheno table from URL: ", pheno.url)
  download.file(url = pheno.url, destfile = pheno.file)  
  message("Storing local copy of pheno table in file: ", pheno.file)
}
message("Loading pheno data from ", pheno.file)
denboer2009.pheno <- read.table(pheno.file, sep='\t', head=TRUE, row=1, comment.char="")

## Load a file with group descriptions (short label, color)
group.descr.url <- file.path(denboer.url, 'GSE13425_group_descriptions.tab')
group.descr.file <- file.path(dir.results, 'GSE13425_group_descriptions.tab')
if (!file.exists(group.descr.file)) {
  message("Downloading group descriptions from URL: ", group.descr.url)
  download.file(url = group.descr.url, destfile = group.descr.file)  
  message("Storing local copy of group descriptions in file: ", group.descr.file)
}
message("Loading group descriptions from ", group.descr.file)
group.descriptions <- read.table(group.descr.file, sep='\t', head=TRUE, row=1)

Once the whole data set has been loaded, the data frame denboer2009.expr should contain 22,283 rows (genes) and 190 columns (samples).

We can now inspect the “phenotypic” data.

Remark: the data frame named denboer2009.pheno contains the information defined as “phenotypic” for general analysis of microarrays. However, in the case of DenBoer data set, the different classes of ALL are characterized by genotypic characteristics (the mutation that caused the leukemia) rather than by phenotypic traits.

## Get the list of column names in the pheno table.
names(denboer2009.pheno)
[1] "Sample.title"               "Sample.source.name.ch1"     "Sample.characteristics.ch1" "Sample.description"         "sample.labels"              "sample.colors"             

The column Sample.title indicates the cancer subtype corresponding to each sample. We can count the number of samples per subtype, and display them by decreasing group size.

## Print the subtypes associated to each sample
print(sample(denboer2009.pheno$Sample.title, size = 10))
 [1] T-ALL               TEL-AML1            T-ALL               pre-B ALL           E2A-rearranged (EP) pre-B ALL           T-ALL               TEL-AML1            T-ALL               TEL-AML1           
Levels: BCR-ABL BCR-ABL + hyperdiploidy E2A-rearranged (E-sub) E2A-rearranged (E) E2A-rearranged (EP) hyperdiploid MLL pre-B ALL T-ALL TEL-AML1 TEL-AML1 + hyperdiploidy
## Print the number of samples per cancer type
kable(data.frame("n"=sort(table(denboer2009.pheno$Sample.title),decreasing=T)))
n.Var1 n.Freq
hyperdiploid 44
pre-B ALL 44
TEL-AML1 43
T-ALL 36
E2A-rearranged (EP) 8
BCR-ABL 4
E2A-rearranged (E-sub) 4
MLL 4
BCR-ABL + hyperdiploidy 1
E2A-rearranged (E) 1
TEL-AML1 + hyperdiploidy 1

For the sake of visualization, we also defined short sample labels corresponding to each ALL subtype.

## Print the subtypes associated to each sample
sample.labels <- as.vector(denboer2009.pheno$sample.labels)
sample.colors <- as.vector(denboer2009.pheno$sample.colors)

print(sample(sample.labels, size = 10))
 [1] "BE" "T"  "Bt" "Bt" "Bo" "T"  "Bo" "Bh" "BM" "Bt"
## Print the number of samples per cancer type
kable(data.frame("n"=sort(table(sample.labels),decreasing=T)))
n.sample.labels n.Freq
Bh 44
Bo 44
Bt 43
T 36
BEp 8
Bc 4
BEs 4
BM 4
Bch 1
BE 1
Bth 1

The correspondence between sample titles (subtypes) and sample labels (abbreviated subtypes) is stored in group.labels. We also defined group-specific colors (group.colors) to highlight the different subtypes of ALL on the plots.

## Print the correspondences in a more readable way
kable(group.descriptions)
group.labels group.colors
T-ALL T darkgreen
TEL-AML1 Bt red
TEL-AML1 + hyperdiploidy Bth blue
hyperdiploid Bh brown
E2A-rearranged (E) BE green
E2A-rearranged (EP) BEp orange
E2A-rearranged (E-sub) BEs darkgray
BCR-ABL Bc cyan
BCR-ABL + hyperdiploidy Bch black
MLL BM violet
pre-B ALL Bo darkblue
group.labels <- group.descriptions$group.labels
names(group.labels) <- row.names(group.descriptions)
group.colors <- group.descriptions$group.colors
names(group.colors) <- row.names(group.descriptions)
group.legend <- paste(sep=" = ",group.labels, row.names(group.descriptions))

Reducing the dimensionality of the data - feature selection

An recurrent problem with microarray data is the large dimensionality of the data space. The dataset we are analyzing contains 190 samples (the subjects), each characterized by 22,283 gene expression values (the variables).

The number of variables (also called features, the genes in our case) thus exceeds by far the number of objects (samples in this case). This sitation is qualified of over-dimensionality, and poses serious problems for classification.

In unsupervised classification (clustering), the relationships between objects will be affected, since a vast majority of the features are likely to be uninformative, but will however contribute to the computed (dis)similarity metrics (whichever metrics is used). Overdimensionality will somewhat mask the signal (biologically relevant relationships between gene groups) with noise. In supervised classification, the effect may be even worse: a program will be trained to recognize classes on the basis of supriousn differences found in any combination of the input variables.

It is thus essential, for both unsupervised and unsupervised classification, to perform some feature selection before applying the actual classification.

An arbitrary pair of genes

To get some feeling about the data, we will compare the expression levels of two genes selected in an arbitrary way (resp. \(236^{th}\) and \(1213^{th}\) rows of the profile table).

## Plot the expression profiles of two arbitrarily selected genes
g1 <- 236
g2 <- 1213
x <- as.vector(as.matrix(denboer2009.expr[g1,]))
y <- as.vector(as.matrix(denboer2009.expr[g2,]))
plot(x,y,
       col=sample.colors,
       type="n",
       panel.first=grid(col="black"),     
       main="Den Boer (2009), two arbitrary genes", 
      xlab=paste("gene", g1), ylab=paste("gene", g2))
text(x, y,labels=sample.labels,col=sample.colors,pch=0.5)
legend("topright",col=group.colors, 
         legend=group.legend,pch=1,cex=0.6,bg="white",bty="o")
**Comparison of expression values per sample for two arbitrary genes**. Each dot corresponds to one sample, showe color and label denotes the ALL group of the patient.

Comparison of expression values per sample for two arbitrary genes. Each dot corresponds to one sample, showe color and label denotes the ALL group of the patient.

# silence <- dev.copy2pdf(file="two_arbitrary_genes_plot.pdf", width=8, height=8)

Each dot corresponds to one sample. Since the genes were initially selected at random, we don’t expect them to be particularly good at discriminating the different subtypes of ALL. However, we can already see some emerging patterns: the T-ALL (labelled “T”) show a trends to strongly express the \(236^{th}\) gene, and, to a lesser extent, several hyperdiploid B-cells (label “Bh”) show a high expression of the \(1213^{th}\) gene. Nevertheless, it would be impossible to draw a boundary that would perfectly separate two subtypes in this plot.

Variance filter

In order to reduce the dimensionality of the data set, we will sort the genes according to their variance, and retain a defined number of the top-ranking genes. Note that this ranking is a very rudimentary way to select a subset of genes. Indeed, nothing guarantees us that the genes showing the largest inter-individual fluctuations of expression have anything related to cancer type. In further sections, we will present alternative methods of variable ordering, which will explicitly take into account the inter-group variance.

The R function apply() can be used to apply any function to each row (or alternatively each column) of a data table. We set the second argument (margin) to 1, thereby indicating that the function (third argument) must be applied to each row of the input table (first argument).

Variance per gene

## Compute gene-wise variance
var.per.gene <- apply(denboer2009.expr, 1, var)
sd.per.gene <- apply(denboer2009.expr, 1, sd)

## Inspect the distribution of gene-wise variance and standard deviation (more readable)
par(mfrow=c(1,2))
hist(var.per.gene, breaks=100, col="#BBFFDD", main="Variance per gene", xlab="Variance", ylab="Number of genes")
hist(sd.per.gene, breaks=100, col="#BBFFDD", main="Standard dev. per gene", xlab="Standard deviation", ylab="Number of genes")
**Gene-wise distributions of expression variance and standard deviation**.

Gene-wise distributions of expression variance and standard deviation.

silence <- dev.copy2pdf(file=file.path(dir.figures, "denboer2009_variance_per_gene.pdf"), width=10,height=5)
par(mfrow=c(1,1))

We notice that gene-wise variances have a wide dispersion. the histogram shows a right-skewed distribution, with a long tail due to the presence of a few genes with high variance. If we sort genes by decreasing variance, we can see that there is a strong difference between the top and the bottom of the list.

## Sort genes per decreasing variance
genes.by.decr.var <- sort(var.per.gene,decreasing=TRUE)

## Print the 5 genes with highest and lowest variance
head(genes.by.decr.var)
     CD9|201005_at  IL23A|211796_s_at     CD3D|213539_at S100A8|202917_s_at   KLF4|221841_s_at  HLA-DRA|208894_at 
          6.042537           5.929345           5.696317           5.506174           5.245490           5.160823 
tail(genes.by.decr.var)
 C7orf25|53202_at     FMO3|40665_at     SCLY|59705_at NA|AFFX-DapX-5_at  HSD17B6|37512_at NA|AFFX-LysX-5_at 
      0.007035592       0.007000936       0.006630802       0.006543372       0.006439234       0.005077204 

We can then select an arbitrary number of top-ranking genes in the list.

## Select the 30 top-ranking genes in the list sorted by variance.
## This list of genes will be used below as training variables for
## supervised classification.
top.nb <- 20 ## This number can be changed for testing
genes.selected.by.var <- names(genes.by.decr.var[1:top.nb])

## Check the names of the first selected genes
head(genes.selected.by.var, n=top.nb) 
 [1] "CD9|201005_at"       "IL23A|211796_s_at"   "CD3D|213539_at"      "S100A8|202917_s_at"  "KLF4|221841_s_at"    "HLA-DRA|208894_at"   "NA|216379_x_at"      "IL23A|210915_x_at"   "HLA-DRA|210982_s_at" "ITM2A|202746_at"     "RPS4Y1|201909_at"    "NA|209771_x_at"      "NPY|206001_at"      
[14] "SH3BP5|201811_x_at"  "NRN1|218625_at"      "LCK|204891_s_at"     "DDX3Y|205000_at"     "IL23A|213193_x_at"   "DEFA1|205033_s_at"   "TCF4|212386_at"     

Assigning ranks to genes according to some sorting criterion

In the next sections, we will compare methods for selecting genes on the basis of different criteria (variance, T-test, ANOVA test, step-forward procedure in Linear Discriminant Analysis). For this purpose, we will create a table indicating the values and the rank of each gene (rows of the table) according to each selection criterion (columns).

Ranking genes by cross-sample variance

We first instantiate this table with two columns indicating the name and variance of each gene. We will then add columns indicating the rank of each gene according to its variance, and to other criteria.

## Create a data frame to store gene values and ranks 
## for different selection criteria.
gene.ranks <- data.frame(var=var.per.gene)
head(gene.ranks)
                         var
DDR1|1007_s_at   0.188665207
RFC2|1053_at     0.060678822
HSPA6|117_at     0.456321774
PAX8|121_at      0.027842854
GUCA1A|1255_g_at 0.007850359
UBA7|1294_at     0.190810816

The R function rank() assigns a rank to each element of a list of values, by increasing order. We will use a trick to assign ranks by decreasing order: compute the rank of minus variance. The option ties defines the way to treat equal values.

## Beware, we rank according to minus variance, because 
## we want to associate the lowest ranks to the highest variances
gene.ranks$var.rank <- rank(-gene.ranks$var, ties.method='random')
head(gene.ranks, n=10)
                         var var.rank
DDR1|1007_s_at   0.188665207     5367
RFC2|1053_at     0.060678822    10648
HSPA6|117_at     0.456321774     1973
PAX8|121_at      0.027842854    17931
GUCA1A|1255_g_at 0.007850359    22265
UBA7|1294_at     0.190810816     5307
THRA|1316_at     0.017333392    20897
PTPN21|1320_at   0.010962350    22082
CCL5|1405_i_at   0.111180852     7608
CYP2E1|1431_at   0.014258950    21576

In this table, genes are presented in their original order (i.e. their order in the microarray dataset). We can reorder them in order to print them by decreasing variance, and check five top and bottom lines of the sorted table.

## Check the rank of the 5 genes with highest and lowest variance, resp.
kable(gene.ranks[names(genes.by.decr.var[1:5]),], caption = "Five genes with the  highest variance.")
Five genes with the highest variance.
var var.rank
CD9|201005_at 6.042537 1
IL23A|211796_s_at 5.929345 2
CD3D|213539_at 5.696317 3
S100A8|202917_s_at 5.506174 4
KLF4|221841_s_at 5.245490 5
## Print the bottom 5 genes of the variance-sorted table
kable(gene.ranks[names(tail(genes.by.decr.var)),], caption = "Five genes with the  lowest variance.")
Five genes with the lowest variance.
var var.rank
C7orf25|53202_at 0.0070356 22278
FMO3|40665_at 0.0070009 22279
SCLY|59705_at 0.0066308 22280
NA|AFFX-DapX-5_at 0.0065434 22281
HSD17B6|37512_at 0.0064392 22282
NA|AFFX-LysX-5_at 0.0050772 22283

Actually, we have no specific reason to think that genes having a high variance will be specially good at discriminating ALL subtypes. Indeed, a high variance might a priori either reflect cell-type specificities, or unrelated differences between samples resulting from various effects (individual patient genomes, transcriptome, condition, …). For the sake of curiosity, let us plot samples on a XY plot where the abcsissa represents the top-raking, and the ordinate the second top-ranking gene in the variance-ordered list.

## Select the gene with the highest variance
maxvar.g1 <- names(genes.by.decr.var[1])
print(maxvar.g1)
[1] "CD9|201005_at"
## Select the second gene withthe highest variance
maxvar.g2 <- names(genes.by.decr.var[2])
print(maxvar.g2)
[1] "IL23A|211796_s_at"
## Plot the expression profiles of the two genes with highest variance
x <- as.vector(as.matrix(denboer2009.expr[maxvar.g1,]))
y <- as.vector(as.matrix(denboer2009.expr[maxvar.g2,]))
plot(x,y,
      col=sample.colors,
      type='n',
      panel.first=grid(col='black'), 
      main="2 genes with the highest variance", 
      xlab=paste('gene', maxvar.g1), 
      ylab=paste('gene', maxvar.g2))
text(x, y,labels=sample.labels,col=sample.colors,pch=0.5)
legend('topright',col=group.colors, 
         legend=group.legend,pch=1,cex=0.6,bg='white',bty='o')
**Expression values for the two genes with the highest variance**. Labels and colors indicate the ALL group of each patient. Note that some groups appear more or less separated (T, Bt, Bo).

Expression values for the two genes with the highest variance. Labels and colors indicate the ALL group of each patient. Note that some groups appear more or less separated (T, Bt, Bo).

# silence <- dev.copy2pdf(file="denboer2009_2_maxvar_genes.pdf", width=8, height=8)

We can draw boxplots per subtype of ALL to highlight the inter-group differences between expression measurements.

par(mfrow=c(1,2))
boxplot(x ~ sample.labels, las=1,
        #       col=group.colors,
        horizontal=TRUE,
        main=maxvar.g1, 
        col="#BBBBBB")
data.frame(group.colors)
                         group.colors
T-ALL                       darkgreen
TEL-AML1                          red
TEL-AML1 + hyperdiploidy         blue
hyperdiploid                    brown
E2A-rearranged (E)              green
E2A-rearranged (EP)            orange
E2A-rearranged (E-sub)       darkgray
BCR-ABL                          cyan
BCR-ABL + hyperdiploidy         black
MLL                            violet
pre-B ALL                    darkblue
boxplot(y ~ sample.labels, las=1,
        horizontal=TRUE,
        main=maxvar.g2, col="#BBBBBB")
**Expression per ALL classes for the two genes with the highest variance.**

Expression per ALL classes for the two genes with the highest variance.

par(mfrow=c(1,1))

Somewhat surprizingly, the plot shows that, despite the roughness of our ranking criterion (variance across all cancer types), the two top-ranking genes already discriminate several sample types.

  • The expression level of CD9 (abcsissa of the plot) gives a pretty good separation between some cancer types with low expression levels (T and Bt), and some other types characterized by a high expression level (Bh, BEs, BEp). Cancer type Bo is however dispersed over the whole range of CD9 expression.

  • IL23A clearly separates the subtype “T” (high levels) from all other cancer subtypes (low levels).

Variable ordering by Welch t-test (2-groups test)

The t-test tests the hypothesis of equality between the means of two populations. We can use the Welch version of the t-test (which assumes that groups can have different variances) in order to select genes differentially expressed between two goups of samples.

\[m_1 = m_2\]

We thus need to define two groups of samples (multi-group comparisons will be treated by ANOVA). For the dataset on ALL, we have several subtypes of cancer. we will perform pairwise comparisons of mean for a selection of subtypes. In each case, a given subtype (e.g. T-cells) will be compared to all other subtypes pooled together.

We will successvely run Welch’s t-test for the main subtypes (“Bo”, “Bh”, “Bt”, “T”).

\[H_0:; m_{Bo} = m_{others}\]

\[H_0:; m_{Bh} = m_{others}\]

\[H_0:; m_{Bt} = m_{others}\]

\[H_0:; m_{T} = m_{others}\]

In each case, apply the test in to each gene, using the function t.test.multi() defined in the R utilities of this course.

## Load a utility to apply Welch test on each row of a table
source(file.path(dir.util, "util_student_test_multi.R"))

## Define a vector indicating whether each sample 
## belongs to the subtype of interest (e.g. "Bo") or not.
group.of.interest <- "Bo"
one.vs.others<- sample.labels
one.vs.others[sample.labels != group.of.interest] <- "o"
print(table(one.vs.others))
one.vs.others
 Bo   o 
 44 146 

To apply Welch test to each gene of the table, we will use our custom function t.test.multi() (see the previous practical on selection of differentialy expressed genes, DEG).

## Test the mean equality between Bo subtype 
## and all other subtypes 
welch.one.vs.others <- t.test.multi(denboer2009.expr, 
                                    one.vs.others,
                                    volcano.plot = FALSE)
[1] "Mon Nov 20 01:48:12 2017 - Multiple t-test started"
[1] "Mon Nov 20 01:48:15 2017 - Multiple t-test done"

This method returns a table with various statistics about DEG.

kable(head(welch.one.vs.others), caption = "Head of the Welch result table. Each row corrresponds to one probeset (gene), each column to one statistics used for the Welch test.")
Head of the Welch result table. Each row corrresponds to one probeset (gene), each column to one statistics used for the Welch test.
mean.o mean.Bo means.diff var.est.o var.est.Bo sd.est.o sd.est.Bo st.err.diff t.obs df.welch P.value E.value sig fwer q.value fdr rank
DDR1|1007_s_at 4.532534 4.697045 -0.1645112 0.1638439 0.2554725 0.4047763 0.5054428 0.0832371 -1.9764175 60.55752 0.0526672 1.173583e+03 -3.0695137 1.0000000 0.1650378 0.2453654 4783
RFC2|1053_at 3.172397 3.142727 0.0296700 0.0643273 0.0490947 0.2536283 0.2215733 0.0394511 0.7520705 79.96649 0.4542167 1.012131e+04 -4.0052367 1.0000000 0.4855777 0.7219194 14020
HSPA6|117_at 2.753014 2.662955 0.0900592 0.5631109 0.1004539 0.7504071 0.3169446 0.0783579 1.1493304 168.44430 0.2520491 5.616410e+03 -3.7494588 1.0000000 0.3692803 0.5490174 10229
PAX8|121_at 6.154110 6.179318 -0.0252086 0.0271940 0.0301786 0.1649061 0.1737199 0.0295320 -0.8536033 68.03779 0.3963201 8.831202e+03 -3.9460198 1.0000000 0.4549176 0.6763363 13057
GUCA1A|1255_g_at 2.174932 2.188636 -0.0137049 0.0072459 0.0099237 0.0851226 0.0996177 0.0165882 -0.8261827 63.10013 0.4118135 9.176441e+03 -3.9626743 1.0000000 0.4636132 0.6892642 13313
UBA7|1294_at 5.518904 5.854546 -0.3356413 0.1653478 0.1925323 0.4066298 0.4387851 0.0742176 -4.5223948 66.81156 0.0000257 5.725322e-01 0.2422001 0.4359089 0.0011066 0.0016452 348

We will compute a new rank, based on the (decreasing) significance of the Welch test. We will then successively perform the same analysis by selecting as group of interest the 4 subtypes of leukemia represented by at least 30 patients in Den Boer 2009.

## Update the gene rank table
test.name <- paste(group.of.interest, '.vs.others.sig', sep='')

## Add a column with significances
gene.ranks[,test.name] <- welch.one.vs.others$sig 

## Add a column with significance ranks
gene.ranks[,paste(test.name, ".rank", sep="")] <- 
  rank(-welch.one.vs.others$sig, ties.method='random') 

## Apply the Welch test for the 3 other majority groups
for (group.of.interest in c("Bh", "Bt", "T", "Bo")) {
    print(paste("Selecting differentially expressed genes for", group.of.interest, "versus others"))
    one.vs.others <- sample.labels
    one.vs.others[sample.labels != group.of.interest] <- "o"
    
    ## Test the mean equality between Bo subtype 
    ## and all other subtypes 
    welch.one.vs.others <- t.test.multi(denboer2009.expr, 
                                              one.vs.others,
                                              volcano.plot = FALSE)

    ## Store the volcano plot
    silence <- dev.copy2pdf(file=file.path(dir.figures, 
                                paste(sep="", "denboer2009_welch_", 
                                     group.of.interest,"_vs_others_volcano.pdf")))

    ## Update the gene rank table
    test.name <- paste(group.of.interest, '.vs.others.sig', sep='')
    gene.ranks[,test.name] <- welch.one.vs.others$sig
    gene.ranks[,paste(test.name, ".rank", sep="")] <- 
      rank(-welch.one.vs.others$sig, , ties.method='random')
}
[1] "Selecting differentially expressed genes for Bh versus others"
[1] "Mon Nov 20 01:48:15 2017 - Multiple t-test started"
[1] "Mon Nov 20 01:48:18 2017 - Multiple t-test done"
[1] "Selecting differentially expressed genes for Bt versus others"
[1] "Mon Nov 20 01:48:18 2017 - Multiple t-test started"
[1] "Mon Nov 20 01:48:20 2017 - Multiple t-test done"
[1] "Selecting differentially expressed genes for T versus others"
[1] "Mon Nov 20 01:48:20 2017 - Multiple t-test started"
[1] "Mon Nov 20 01:48:23 2017 - Multiple t-test done"
[1] "Selecting differentially expressed genes for Bo versus others"
[1] "Mon Nov 20 01:48:23 2017 - Multiple t-test started"
[1] "Mon Nov 20 01:48:25 2017 - Multiple t-test done"
## Check the resulting gene table
head(gene.ranks)
                         var var.rank Bo.vs.others.sig Bo.vs.others.sig.rank Bh.vs.others.sig Bh.vs.others.sig.rank Bt.vs.others.sig Bt.vs.others.sig.rank T.vs.others.sig T.vs.others.sig.rank
DDR1|1007_s_at   0.188665207     5367       -3.0695137                  4783        -2.575819                  5878       -4.1639400                 17823       6.4770162                  975
RFC2|1053_at     0.060678822    10648       -4.0052367                 14020        -3.754101                 11957       -0.8968016                  2743       0.3408775                 3688
HSPA6|117_at     0.456321774     1973       -3.7494588                 10229        -1.550261                  3784       -3.7866852                 12298      -4.1725298                19315
PAX8|121_at      0.027842854    17931       -3.9460198                 13057        -4.335538                 21927       -4.1135878                 16868      -2.9537047                 9979
GUCA1A|1255_g_at 0.007850359    22265       -3.9626743                 13313        -3.745846                 11890       -4.1881282                 18337      -4.2746123                20919
UBA7|1294_at     0.190810816     5307        0.2422001                   348         1.403509                  1307       -2.6274822                  5805      -2.9568734                 9988
head(gene.ranks[order(gene.ranks$T.vs.others.sig.rank),])
                       var var.rank Bo.vs.others.sig Bo.vs.others.sig.rank Bh.vs.others.sig Bh.vs.others.sig.rank Bt.vs.others.sig Bt.vs.others.sig.rank T.vs.others.sig T.vs.others.sig.rank
CD19|206398_s_at  2.560138      102         7.357871                     2        -3.567715                 10384       -0.2931564                  2196        86.91159                    1
PAX5|221969_at    2.751943       82         1.921971                   114        -3.992351                 14804        8.3607804                   176        84.66640                    2
NPY|206001_at     4.623218       13        -4.083051                 15471        11.943818                    28        3.2805707                   700        73.77472                    3
HBEGF|38037_at    2.078603      160        -2.016659                  1725         6.102741                   247       -0.5302039                  2392        71.70438                    4
HBEGF|203821_at   2.729874       86        -1.516811                  1148         7.615853                   146        0.9442914                  1447        62.58705                    5
CD79B|205297_s_at 1.982178      178        -3.112284                  5003         3.506884                   631       -2.9861922                  7024        60.44350                    6
tail(gene.ranks[order(gene.ranks$T.vs.others.sig.rank),])
                           var var.rank Bo.vs.others.sig Bo.vs.others.sig.rank Bh.vs.others.sig Bh.vs.others.sig.rank Bt.vs.others.sig Bt.vs.others.sig.rank T.vs.others.sig T.vs.others.sig.rank
SFRS2IP|213850_s_at 0.17858318     5596        -4.026953                 14433        -3.657518                 11105        -4.284142                 20503       -4.347644                22278
ZNF506|221625_at    0.02847691    17747        -4.028374                 14456        -3.662692                 11158        -4.009612                 15125       -4.347719                22279
NA|215378_at        0.01915987    20440        -3.993924                 13818        -4.221549                 18975        -2.662386                  5907       -4.347830                22280
SP110|209762_x_at   0.25051206     4133        -3.457591                  7302        -3.762774                 12046        -3.194937                  7986       -4.347958                22281
CNN2|201605_x_at    0.34210598     2867        -4.071574                 15273        -4.066561                 15978        -3.233027                  8180       -4.347964                22282
ZNF134|215315_at    0.01568122    21260        -3.691499                  9520        -3.210734                  8199        -4.108765                 16777       -4.347974                22283
## Store the gene rank table in a text file (tab-separated columns)
write.table(gene.ranks, file=file.path(dir.results, 'DenBoer_gene_ranks.tab'), 
            sep='\t', quote=F, col.names=NA)
par(mfrow=c(1,2))
## Plot variance against significance of the Welch test 
## for the Bh versus other groups
plot(gene.ranks[,c("var", "Bh.vs.others.sig")], col="grey",
     xlab="Global variance", ylab="Significance of Welch t-test Bh versus others",
     main="Global versus group-specific scores")
# silence <- dev.copy2pdf(file="denboer2009_var_vs_welch_Bh_plot.pdf", width=12, height=12)


## Plot ranks of variance against significance of the Welch test 
## for the 4 different groups
plot(gene.ranks[,c("var.rank", "Bh.vs.others.sig.rank")], 
     xlab="Rank by global variance", ylab="Rank by t-test Bh versus others",
     main="Global versus group-specific ranks",
     col="grey")
**Comparison between global variance and Welch test significance**. The plots highlight the strong difference between the global sorting criterion (variance computed across all samples) and a subtype-specific criterion (Welch test significance for one ALL subtype versus all others). The ranking .

Comparison between global variance and Welch test significance. The plots highlight the strong difference between the global sorting criterion (variance computed across all samples) and a subtype-specific criterion (Welch test significance for one ALL subtype versus all others). The ranking .

# silence <- dev.copy2pdf(file="denboer2009_var_vs_welch_Bh_rank_plot.pdf", width=12, height=12)
par(mfrow=c(1,1))

An obvious observation: the gene-wise variance and the 4 results of Welch tests (one cancer type against all other types) return very different values, and assign gene ranks in completely different ways. This is not surprizing, the variance and the 4 significances indicate different properties of the genes.

  • The variance is a global property of the genes, and has not specifically to be related to a given subtype of cancer.
  • The other ranking criteria should rank genes in different orders by construction, since each one was based on a different group-specific DEG test.

Gene selection will thus have to be adapted to the specific purpose of the classification (e.g. recognizing one particular subtype of ALL).

In summary, so far we defined 5 different gene sorting criteria that could be used to select subsets of features (genes) in order to train a classifier: one general criterion (variance), and 4 group-specific criteria (Welch test results for a comparison between one specific group and all the other ones). A priori, we would expect to achieve better results with group-specific feature selection.

ANOVA-based variable ordering (multiple subtypes)

The ANOVA test allows to select genes whose inter-group variance is significantly higher than the so-called residual variance, i.e. the variance remaining after removing the inter-group variance from the todal variance.

In the context of detection of differentially expressed genes, ANOVA can be thought of as generalization of the Welch test. Indeed, Welch and Student tests were conceived to test the equality of the mean between two groups (for example one subtype of interest and all the other subtypes), whereas ANOVA simultaneously tests the equality of the mean for multiple groups.

\[H_0= \mu_1 = \mu_2= \cdots = \mu_g\]

We could thus use ANOVA to establish a general ranking criterion that would select the genes showing higest differences between ALL subtypes, without specifying a priori which particular subtypes have to be different.

In a first time, we will apply the ANOVA test to one arbitrarily selected gene. We will then see how to run this test on each row of the expression matrix.

Some remarks about the implementation.

  1. In contrast with the Welch test, which was a 2-groups test, ANOVA can be used to compare multiple groups in a single analysis. For ANOVA, we will use the original sample labels (with all the ALL subtypes explicitly named), rather than the one.vs.other vector that we created for 2-groups analysis.

  2. We will run a single-factor ANOVA, with gene expression as values, and sample labels as grouping.

  3. The R methods aov() and anova() take as input a data frame with the values (gene expression values) in the first column, and the groupings (sample labels) in the second one.

  4. R proposes two methods for the ANOVA test. The aov() function automatically fits a linear model and runs the anova test. However it is conceived for balanced groups (all groups should have smiliar effectives), which is not our case (some of the ALL subtypes have very few samples). With our data, it returns a warning “Estimated effects must be unbalanced”. Hereafter we will run both approaches to illustrate their implementation., but the second one (anova()) is the most flexible.

  5. Even though anova() can handle unbalanced groups, we should keep in mind that the power of the test depends on the fact that we dispose of a sufficient number of samples per group. It might thus be wise to restrict the analysis to the groups containing a minimum number of samples (for example at least 8).

g <- 1234 ## Select an arbitrary gene

## Build a data frame with gene expression values in the first column, 
## and sample labels in the second column.
g.expr <- unlist(denboer2009.expr[g,]) ## Select the expression profile for this gene
g.for.anova <- data.frame("expr"=g.expr, "group"=sample.labels)

## Run the aov() method to check the warnings
g.aov.result <- aov(formula = expr ~ group, data = g.for.anova)
print(g.aov.result)
Call:
   aov(formula = expr ~ group, data = g.for.anova)

Terms:
                   group Residuals
Sum of Squares  0.826310  6.654919
Deg. of Freedom       10       179

Residual standard error: 0.1928168
Estimated effects may be unbalanced
## We thus try the indirect approach: fit a linear model and run anova on it.
g.anova.result <- anova(lm(formula = expr ~ group, data = g.for.anova))
print(g.anova.result)
Analysis of Variance Table

Response: expr
           Df Sum Sq  Mean Sq F value Pr(>F)  
group      10 0.8263 0.082631  2.2226 0.0184 *
Residuals 179 6.6549 0.037178                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Extract the p-value from the ANOVA result
attributes(g.anova.result)
$names
[1] "Df"      "Sum Sq"  "Mean Sq" "F value" "Pr(>F)" 

$row.names
[1] "group"     "Residuals"

$class
[1] "anova"      "data.frame"

$heading
[1] "Analysis of Variance Table\n" "Response: expr"              
pval <- as.numeric(unlist(g.anova.result)["Pr(>F)1"])
print(pval)
[1] 0.01840098
## Compute the e-value from this p-value
eval <- pval * nrow(denboer2009.expr)
print(eval)
[1] 410.029
## Summarise the result in a vector
g.anova.summary <- data.frame("g"=g, 
                     "name"=row.names(denboer2009.expr[g,]),
                     "pval"=pval,
                     "eval"=eval,
                     "sig"=-log(eval, base=10))
kable(g.anova.summary, caption = "Anova result for an aribtrary gene. ")
Anova result for an aribtrary gene.
g name pval eval sig
1234 PEX19|201706_s_at 0.018401 410.029 -2.612815

How to interpret the ANOVA result?

TO BE COMPLETED

Exercise: running ANOVA on each gene of the expression table

Exercise: Write a function that applies anova() on each row of a data.frame and returns the relevant statistics.

Reducing data dimensionality by Principal Component Analysis (PCA)

Before starting the proper process of supervised classification, we can apply a method called Principal Component Analysis (PCA) to evaluate the repartition of the information between the mutiple variables, and to inspect the “intrinsic” structure of the data, i.e. the structure inherent to the numbers in the data table, irrespective of the labels (cancer subtypes) attached to the various samples.

Purpose of PCA

We can do the exercise of extending this 2-dimensional plot to a 3-dimensional plot, where the third dimension represents the expression level of a third gene. With an effort of imagination, we can mentally extend this 3D plot to a 4-dimensional plot, where each dimension would represent a different gene. It is likely that the groups of genes will progressively become more separated as the number of dimension increases. However, for the sake of graphical representation, it is difficult to work with more than 2 (or at most 3) dimensions.

The purpose of Principal Component Analysis (PCA) is to capture the largest part of the variance of a data set with a minimal number of dimensions.

Applying PCA transformation with stats::prcomp()

The R method stats::prcomp() performs a PCA transformation of an input table. We can feed it with the expression table, but we need to transpose it first (using the R function t()), in order to provide our objects of interest (the samples) as rows, and the variables (genes) as columns.

## load the stats library to use the princomp() and prcomp() function
library(stats) 

## Perform the PCA transformation
expr.prcomp <- prcomp(t(denboer2009.expr))

## Analyze the content of the prcomp result: 
## the result of the method prcomp() is an object 
## belonging to the class "prcomp"
class(expr.prcomp) 
[1] "prcomp"
## Get the field names of the prcomp objects
names(expr.prcomp) 
[1] "sdev"     "rotation" "center"   "scale"    "x"       
## Get the attributes of the prcomp objects
attributes(expr.prcomp) 
$names
[1] "sdev"     "rotation" "center"   "scale"    "x"       

$class
[1] "prcomp"

Repartition of the standard deviation along the components

A first information is the repartition of the variance (or its squared root, the standard deviation) between the components, which can be displayed on a plot.

plot(expr.prcomp, main='Den Boer (2009), 
     Variance  per component', xlab='Component', col="#BBDDFF")

The standard deviation barplot (Figure~) highlights that the first component captures more or less twice as much standard deviation of the whole dataset as the second one. We can measure the relative importance of the standard deviations.

## Get the standard deviation and variance per principal component
sd.per.pc <- expr.prcomp$sdev
var.per.pc <- sd.per.pc^2

## Display the percentage of total variance explained by each 
sd.per.pc.percent <- sd.per.pc/sum(sd.per.pc)
var.per.pc.percent <- var.per.pc/sum(var.per.pc)
barplot(var.per.pc.percent[1:10], main='Den Boer (2009), Percent of variance  per component', xlab='Component', ylab='Percent variance', col='#BBDDFF')

silence <- dev.copy2pdf(file="pca_variances.pdf", width=7, height=5)

Analysis of the first versus second component

We can generate a biplot, where each sample appears as a dot, and the X and Y axes respectively represent the first and second components of the PCA-transformed data. The R function stats::biplot() automatically generates the biplot, but the result is somewhat confusing, because the biplot displays the whole sample labels.

## We do not run this, but we provide the instructions just for information
biplot(expr.prcomp,var.axes=FALSE,
       panel.first=grid(col='black'), 
       main=paste('PCA; Den Boer (2009); ',
       ncol(denboer2009.expr), 'samples *', 
       nrow(denboer2009.expr), 'genes', sep=' '), 
       xlab='First component', ylab='Second component')

We will rather generate a custom plot, with labels and colors indicating the subtype of ALL.

## Plot components PC1 and PC2
plot(expr.prcomp$x[,1:2],
       type='n',
       panel.first=c(grid(col='black'), abline(a=30,b=1.3, lwd=3, lty="dashed", col="red")),
         main=paste('PCA; Den Boer (2009); ',
           ncol(denboer2009.expr), 'samples *', nrow(denboer2009.expr), 'genes', sep=' '), 
         xlab='PC1', ylab='PC2')
text(expr.prcomp$x[,1:2],labels=sample.labels,col=sample.colors,pch=0.5)
legend('bottomleft',col=group.colors, 
         legend=names(group.colors),pch=1,cex=0.6,bg='white',bty='o')

silence <- dev.copy2pdf(file="PC1_vs_PC2.pdf", width=8, height=8)

The coloring and cancer-type labels highlight a very interesting property of the PCA result: in the plane defined by the two first components of the PCA-transformed data, we can draw a straight line that perfectly separates T-ALL samples from all other subtypes (red dashed line on the plot). All T-ALL cells are grouped in one elongated cloud on the upper left side of the plot. The other cloud seems to contain some organization as well: subtypes are partly intermingled but there are obvious groupings.

Analysis of the second versus third component

The following figure shows that the second and third components capture information related to the cancer subtypes: the third component separates quite well TEL-AML1 (top) from hyperdiploid (bottom) samples, whereas the pre-B have intermediate values. The separation is however less obvious than the T-ALL versus all other subtypes that we observed in the two first components.

## Plot components PC2 and PC3
plot(expr.prcomp$x[,2:3],
       col=sample.colors,
       type='n',
       panel.first=grid(col='black'), 
         main=paste('PCA; Den Boer (2009); ',
           ncol(denboer2009.expr), 'samples *', nrow(denboer2009.expr), 'genes', sep=' '), 
         xlab='PC2', ylab='PC3')
text(expr.prcomp$x[,2:3],labels=sample.labels,col=sample.colors,pch=0.5)     
legend('bottomleft',col=group.colors, 
         legend=names(group.colors),pch=1,cex=0.6,bg='white',bty='o')

silence <- dev.copy2pdf(file="PC2_vs_PC3.pdf", width=8, height=8)

Exercise on PCA

Generate plots of a few additional components (PC4, PC5,…) and try to evaluate if they further separate subtypes of cancers.

Discussion of the PCA results

In the “historical” dataset from Golub (1999), three sample groups (AML, T-ALL and B-ALL) appeared almost perfectly separated by a simple display of the two first components of the PCA-transformed data. The dataset from DenBoer (2009) is less obvious to interpret, due to the nature of the data itself: firstly, all the samples come from the same cell type (lymphoblasts), whereas the main separation of Golub was between myeloblasts and lymphoblasts. Secondly, Den Boer and co-workers defined a wider variety of subtypes. However, we see that a simple PCA transformation already reveals that the raw expression data is well structured. It is quite obvious that we will have no difficulty to find a signature discriminating T-ALL from other ALL subtypes, since T-ALL are already separated on the plane of the two first components. We would however like to further refine the diagnostics, by training a classifier to discriminate between the other subtypes as well.

Note that until now we did not apply any training: PCA transformation is a “blind” (unsupervised) approach, performing a rotation in the variable space without using any information about pre-defined classes. Since we dispose of such information for the 190 samples of the training set, we can use a family of other approaches, generically called supervised classification, that will rely on class membership of the training samples in order to train a classifier, which will further be used to assign new samples to the same classes.

Selecting principal components as predictor variables}

Optionnally, principal components can be used as predictor variables for supervised classification. The advantage is that the first components supposedly concentrate more information than any individual variables, and the classifier is thus in principle able to achieve a better discrimination with less variables (and thus be less prone to over-fitting). However, we should keep in mind that variables showing a high variance may differ from the most discriminating ones.


Linear Discriminant Analysis (LDA)

Training the classifier

We will define a vector which will associate each individual (sample) to either of two labels: Bo for pre-B ALL samples, and “o” for all the other subtypes.

## Define the labels for a 2-groups classifier
group.of.interest <- "Bo"
one.vs.others<- sample.labels
one.vs.others[sample.labels != group.of.interest] <- "o"
print(table(one.vs.others))
one.vs.others
 Bo   o 
 44 146 

We can now use our dataset to train a linear discriminant classifier, using the R lda() function. In a first time, we will use the whole dataset to train the classifier.

Warning: the step below costs time, memory, and is expected to give very bad results. We run it only for didactic purpose.

Indeed, the goal of this tutorial is to introduce the problems of overfitting due to the over-dimensionality of the variable space, and to show how this problem can be circumvented by selecting a relevant subset of the variables before running the supervised classification.

Feature selection is particularly important when - as in the current case - the dataset contains many more variables (>22.000 genes) than objects (190 samples). In the first trial below we will intently ignore this over-dimensionality of the dataset, and measure the consequences of this bad choice. In the following sections, we will show how to resolve the problem of over-dimensionality by selecting subsets of variables.

## Load the MASS library, which contains the lda() function
library(MASS)

## Train a classifier
one.vs.others.lda.allvars <- lda(t(denboer2009.expr),one.vs.others,CV=FALSE) 

The lda() function issues a warning, indicating that there is a collinearity between some variables. Let us ignore this warning for the time being (this problem will be solved below by selecting datasets with less variables).

Let us inspect the result of lda().

attributes(one.vs.others.lda.allvars)
$names
[1] "prior"   "counts"  "means"   "scaling" "lev"     "svd"     "N"       "call"   

$class
[1] "lda"

Let us inspect the contents of the lda result.

print(one.vs.others.lda.allvars$counts)
 Bo   o 
 44 146 
print(one.vs.others.lda.allvars$prior)
       Bo         o 
0.2315789 0.7684211 
print(one.vs.others.lda.allvars$svd)
[1] 11.17306
  • counts indicates the number of samples belonging to the different training classes (44 pre-ALL B with label “Bo”, and 146 others with label “o”)

  • prior gives the prior probabilities of each class, i.e. the probability for a sample to belong to either of the classes (in our case: 23% Bo, 77% others). By default, priors are simply estimated from the proportion of objects in the training classes, but an experienced user might decide to impose different priors based for example on additional information about the population.

  • svd, the singular value, indicates the ratio of the between-group and within-group standard deviations. We obtain a value of 11.17, suggesting that the classifier will be quite good at discriminating the Bo group from the other samples.

You can inspect yourself the other attributes (N, lev, call)

Using the classifier to assign elements to classes

After having trained the classifier, we can use it to classify new samples. A naive - but wrong - way to test the efficiency of our classifier would be to use it to predict the class of our training objects.

The R predict() function is used to assign a class to a new dataset, based on an object generated by the lda() function. We use it here in a foolish way, to predict class of the training set itself (we will show below the problem).

## Non-recommended usage of the predict() function, since we assign a class to the training objects themselves
predict.lda.allvars <- predict(object=one.vs.others.lda.allvars, newdata = t(denboer2009.expr))

attributes(predict.lda.allvars)
$names
[1] "class"     "posterior" "x"        
## Posterior probabilities (print the 10 first rows only)
print(predict.lda.allvars$posterior[1:10,])
                   Bo         o
GSM338666 0.028867018 0.9711330
GSM338667 0.061919189 0.9380808
GSM338668 0.077322660 0.9226773
GSM338669 0.009379686 0.9906203
GSM338670 0.064386005 0.9356140
GSM338671 0.031259672 0.9687403
GSM338672 0.198596805 0.8014032
GSM338673 0.008065270 0.9919347
GSM338674 0.014430610 0.9855694
GSM338675 0.012408334 0.9875917
## Predicted classes
print(predict.lda.allvars$class)
  [1] o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  Bo o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  Bo o  o  o  o  o  Bo o  o 
 [99] o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  o  Bo Bo Bo Bo o  o  o  o  Bo Bo Bo Bo Bo o  Bo Bo o  Bo Bo Bo Bo o  o  Bo Bo Bo Bo Bo o  o  Bo Bo Bo o  Bo o  o  o  Bo o  Bo Bo o  Bo Bo Bo Bo Bo Bo o  Bo Bo Bo
Levels: Bo o

The predictor computed the posterior probability of each object (sample) to belong to the two training classes (“Bo”" or “others”“, respectively), and assignedeach object to one of these classes based on the highest posterior.

We can now compare training and predicted classes, and measure the hit rate, i.e. the proportion of objects assigned to the correct class.

## Print the contingency table
table(one.vs.others, predict.lda.allvars$class)
             
one.vs.others  Bo   o
           Bo  31  13
           o    8 138
## Compute the hit rate
hits <- sum(one.vs.others == predict.lda.allvars$class)
errors <- sum(one.vs.others != predict.lda.allvars$class)
total <- hits + errors
(hit.rate.internal <- hits / total)
[1] 0.8894737
(error.rate.internal <- errors / total)
[1] 0.1105263

Beware: what we did above is absolutely forbidden: we used a training set to train a classifier, and we assessed the precision of this classifier based on the training set itself.


Evaluating the classifier by cross-validation

K-fold cross-validation

A fair evaluation of a classifier requires to use separate datasets for training and testing, respectively.

A simple way to achieve this is the 2-fold cross-validation to randomly split the original set in two equal parts, and use one for training and the other one for testing. Since the left-out subset has not been used for training, the evaluation is indicative of the expected accuracy when the classifier will be used to predict the class of new samples. However, a classifier trained with only half of the samples will be suboptimal. Splitting the dataset in two equal parts would thus lead to a sub-esimation of its predictive power.

A more refined approach, called k-fold cross-validation, is to randomly split the data set in \(k\) subsets. We can discard one subset, train a classifier with the \(k-1\) remaining subsets, and predict the class of the \(k^{th}\) subset.

Leave-One-Out (LOO) cross-validation

The Leave-one-out (LOO) is a particular mode of cross-validation, where each element of the training set will be in turn discarded (left out) to evaluate a classifier trained with all the other elements. This is the most economical way to perform a k-fold cross-validation. Actually, LOO can be understood as “n-fold” cross-validation, if \(n\) is the number of samples of the training set.

The R function lda() includes a very convenient option (cv, for cross-validation) that automatically runs LOO cross-validation.

## Run a Leave-one-out (LOO) cross-validation
one.vs.others.lda.allvars.loo <- lda(t(denboer2009.expr),one.vs.others,CV=TRUE) 

table(one.vs.others, one.vs.others.lda.allvars.loo$class)
             
one.vs.others Bo  o
           Bo 20 24
           o  75 71
## Compute the hit rate
(hit.rate.loo <- sum(one.vs.others == one.vs.others.lda.allvars.loo$class) / total)
[1] 0.4789474
(error.rate.loo <- 1 - hit.rate.loo)
[1] 0.5210526

We have an obvious contradiction:

  1. The (biased) internal validation shows that, after training, the classifier is able to re-assign 88% of the samples to the right class, among pre-B ALL and the other subtypes.

  2. An unbiased validation based on the Leave-one-out test shows that the apparent accuracy is a trap: when the classifier has to assign a class to a sample not seen before (because it was left out during the training), the hit rate drops to 53%.

Random expectation for the hit rate

We could push further our investigation, by estimating the random expectation for the hit rate. We can test this with a very simple test: let us imagine a random classifier, which would assign labels “Bo” or “o” to each sample according to their prior probabilities (23% and 77% resp.).

A very simple way to simulate such a random classifier is to permute the vector of training labels, and to compute the hit rate between permuted and original labels. We can repeat this random sampling \(10.000\) times and plot an histogram, which will indicate not only the mean behaviour but also the fluctuation around it.

We can also compute the theoretical expectation for the random hit rate.

## Run 10000 permutation tests to estimate the random expectation for the hit rate
random.hit.rates <- vector()
for (rep in 1:10000) {
  random.hit.rates <- append(random.hit.rates, sum(one.vs.others == sample(one.vs.others)) / total)
}
(random.hit.rates.mean <- mean(random.hit.rates))
[1] 0.6438768
## Compute the theoretical value for the random expectation
prior <- as.vector(table(one.vs.others))/length(one.vs.others)
(hit.rate.expect <- sum(prior^2))
[1] 0.6440997
## Draw the histogram
hist(random.hit.rates, breaks=(0:total)/total, col="lightgrey", 
     freq=TRUE,
     main="Hit rate analysis",
     xlab="Hit rate",
     ylab="Frequency")
arrows(x0=hit.rate.loo, y0 = 1000, x1=hit.rate.loo, y1=100, 
       col="darkgreen", lwd=2, code=2, , length=0.2, angle=20)
arrows(x0=hit.rate.internal, y0 = 1000, x1=hit.rate.internal, y1=100, 
       col="red", lwd=2, code=2, , length=0.2, angle=20)
arrows(x0=hit.rate.expect, y0 = 1000, x1=hit.rate.expect, y1=100, 
       col="darkblue", lwd=2, code=2, , length=0.2, angle=20)

legend("topleft", legend=c("random", "random expectation", "LOO", "internal"), 
       col=c("grey", "darkblue", "darkgreen", "red"), 
       lwd=c(4,2,2,2))

The problem of over-fitting

TO BE COMPLETED

Selecting a subset of the variables (feature selection)

In a first attempt to discriminate genes from the Bo class (pre-B-ALL), we will select top-ranking genes according to their significance of the Welch test (Bo versus others).

## Test the mean equality between Bo subtype and all other subtypes 
welch.Bo.vs.others <- t.test.multi(denboer2009.expr, 
                                   one.vs.others,
                                   volcano.plot = FALSE)
[1] "Mon Nov 20 03:56:53 2017 - Multiple t-test started"
[1] "Mon Nov 20 03:57:03 2017 - Multiple t-test done"
## Print the names of the 20 top-ranking genes.
## Note that these are not sorted !
print(rownames(welch.Bo.vs.others[welch.Bo.vs.others$rank <= 20,]))
 [1] "ANXA5|200782_at"     "PGRMC1|201121_s_at"  "CBX1|201518_at"      "ARL4C|202206_at"     "ARL4C|202207_at"     "RYK|202853_s_at"     "PDE4B|203708_at"     "GRK5|204396_s_at"    "GPRASP1|204793_at"   "CD19|206398_s_at"    "CD3G|206804_at"      "VAT1|208626_s_at"    "TUSC3|213423_x_at"  
[14] "RYK|214172_x_at"     "NDFIP1|217800_s_at"  "BAG3|217911_s_at"    "GLT25D1|218473_s_at" "NOLA1|219110_at"     "BCL11B|219528_s_at"  "NOL11|221970_s_at"  
## Sort gene names by Welch sig
welch.Bo.vs.others.sorted <- welch.Bo.vs.others[order(welch.Bo.vs.others$sig, decreasing=TRUE),]
sorted.names <- rownames(welch.Bo.vs.others.sorted)
print(sorted.names[1:20])
 [1] "GPRASP1|204793_at"   "CD19|206398_s_at"    "NDFIP1|217800_s_at"  "RYK|202853_s_at"     "VAT1|208626_s_at"    "ANXA5|200782_at"     "PDE4B|203708_at"     "ARL4C|202206_at"     "BAG3|217911_s_at"    "GLT25D1|218473_s_at" "NOLA1|219110_at"     "GRK5|204396_s_at"    "PGRMC1|201121_s_at" 
[14] "BCL11B|219528_s_at"  "ARL4C|202207_at"     "TUSC3|213423_x_at"   "CBX1|201518_at"      "NOL11|221970_s_at"   "RYK|214172_x_at"     "CD3G|206804_at"     
welch.Bo.vs.others[sorted.names[0:20], c("E.value","sig")]
                         E.value      sig
GPRASP1|204793_at   2.504639e-09 8.601255
CD19|206398_s_at    4.386607e-08 7.357871
NDFIP1|217800_s_at  5.072931e-08 7.294741
RYK|202853_s_at     9.678125e-08 7.014209
VAT1|208626_s_at    1.017503e-07 6.992464
ANXA5|200782_at     1.472846e-07 6.831843
PDE4B|203708_at     1.725215e-07 6.763157
ARL4C|202206_at     8.179158e-07 6.087291
BAG3|217911_s_at    8.402094e-07 6.075612
GLT25D1|218473_s_at 1.379830e-06 5.860174
NOLA1|219110_at     1.605555e-06 5.794375
GRK5|204396_s_at    1.616617e-06 5.791393
PGRMC1|201121_s_at  2.280969e-06 5.641881
BCL11B|219528_s_at  2.991815e-06 5.524065
ARL4C|202207_at     4.193405e-06 5.377433
TUSC3|213423_x_at   7.352632e-06 5.133557
CBX1|201518_at      8.462515e-06 5.072501
NOL11|221970_s_at   1.147071e-05 4.940410
RYK|214172_x_at     1.355142e-05 4.868015
CD3G|206804_at      1.359650e-05 4.866573

For the sake of curiosity, we can draw a plot with the two top-ranking genes in order to see how well they separate our group of interest (pre-B all) from all the other ALL subtypes.

## Print a plot with the two top-ranking genes with the Bo versus other Welch test
g1 <- sorted.names[1]
g2 <- sorted.names[2]
x <- as.vector(as.matrix(denboer2009.expr[g1,]))
y <- as.vector(as.matrix(denboer2009.expr[g2,]))
plot(x,y,
     col=sample.colors,
     type="n",
     panel.first=grid(col="black"), 
     main="Den Boer (2009), 2 top-ranking genes for Welch test", 
     xlab=paste("gene", g1), ylab=paste("gene", g2))
text(x, y,labels=one.vs.others,col=sample.colors,pch=0.5)
legend("bottomright",col=group.colors, 
         legend=names(group.colors),pch=1,cex=0.6,bg="white",bty="o")
**Expression values for the two top-ranking genes from the Welch test (one ALL subtype versus others).** The labels indicate the group used for the Welch test (o stands for "others"), the color indicates the subtype of cancer.

Expression values for the two top-ranking genes from the Welch test (one ALL subtype versus others). The labels indicate the group used for the Welch test (o stands for “others”), the color indicates the subtype of cancer.

silence <- dev.copy2pdf(file=paste(sep="", group.of.interest, "_vs_others_Welch_g1_g2.pdf"), width=8, height=8)

The plot shows a rather good general trends: Bo samples are concentrated in the top-left side of the plot. They have thus generally a low level of expression for the first gene (GPRASP1) and a high level for the second one (CD19).

However, none of the two top-ranking genes is sufficiently discriminant by its own: a selection based on GPRASP1 alone (a vertical boundary on the plot) would fail to distinguish Bo genes from the other ones. The same would be true for a selection based on CD19 alone (horizontal boundary). We can push this experiment further by plotting the third and fourth top-ranking genes according to Welch test.

## Print a plot with the third and fourth top-ranking genes with the Bo versus other Welch test
g3 <- sorted.names[3]
g4 <- sorted.names[4]
x <- as.vector(as.matrix(denboer2009.expr[g3,]))
y <- as.vector(as.matrix(denboer2009.expr[g4,]))

plot(x,y,
     col=sample.colors,
     type="n",
     panel.first=grid(col="black"), 
     main="Den Boer (2009), top-ranking genes 3 and 4 for Welch test", 
     xlab=paste("gene", g3), ylab=paste("gene", g4))
text(x, y,labels=one.vs.others,col=sample.colors,pch=0.5)
legend("bottomright",col=group.colors, 
#         legend=names(denboer2009.group.colors),pch=1,cex=0.6,bg="white",bty="o")
         legend=names(group.colors),pch=1,cex=0.6,bg="white",bty="o")

silence <- dev.copy2pdf(file=paste(sep="", group.of.interest, "_vs_others_Welch_g3_g4.pdf"), width=8, height=8)

Here again, we would not be able to draw a boundary that would perfectly separate the samples of our group of interest from all other ones.

The principle of discriminant analysis is to establish a much more refined criterion to discriminate objects (samples in our cases), by computing a score as a linear combination of the variables. A priori, we could feed the linear discriminant classifier (the R lda() function) with all the variables of the data table (all the genes). However, this would probably return poor results, because when the number of features (genes) is larger than the number of individuals (samples), the classifier will tend to be overfit to the circumstantial particularities of the training objects, and it will fail to classify new objects.

It is thus very important to select a relevant number of variables before training a classifier. It is almost an art to select the most appropriate selection criterion and the optimal number of variables. In this course we will only present some basic methods.

## Load the library containing the linear discriminant analysis function
library(MASS)

## Select the 20 top-ranking genes sorted by significance of the Welch test
top.variables <- 20
selected.genes <- sorted.names[1:top.variables]

## Train the classifier
one.vs.others.lda.classifier <- lda(t(denboer2009.expr[selected.genes,]),one.vs.others,CV=FALSE) 

Evaluating the hit rate by Leave-One-Out (LOO) cross-validation

The lda() function includes an option to automatically run a cross-validation test (cv), which relies on the Leave-one-out procedure (LOO).

The principle of this test is to temporarily discard one element from the training set (left out element), to train a classifier with the remaining elements, and to use the trained classified to predict the group of the left out element. The procedure is applied to each element in turn. This cross-validation ensures a non-biased evaluation of the rates of correct and incorrect classification.

## Use the MASS:lda() function with the cross-validation option
one.vs.others.lda.loo <- lda(t(denboer2009.expr[selected.genes,]),one.vs.others,CV=TRUE) 

## Collect the LOO prediction result in a vector
one.vs.others.loo.predicted.class <- as.vector(one.vs.others.lda.loo$class)
table(one.vs.others.loo.predicted.class)
one.vs.others.loo.predicted.class
 Bo   o 
 50 140 
## Build a confusion table of known versus predicted class
(one.vs.others.lda.loo.xtab <- table(one.vs.others, one.vs.others.loo.predicted.class))
             one.vs.others.loo.predicted.class
one.vs.others  Bo   o
           Bo  37   7
           o   13 133
## Build a more general contingency table to check the assignation 
## of each known group to "Bo" (pre-B ALL) or "o" (other). 
## This permits to perform a more detailed analysis of the 
## misclassifications.
(one.vs.others.lda.loo.xtab2 <- table(sample.labels, one.vs.others.loo.predicted.class))
             one.vs.others.loo.predicted.class
sample.labels Bo  o
          Bc   2  2
          Bch  1  0
          BE   0  1
          BEp  1  7
          BEs  2  2
          Bh   5 39
          BM   0  4
          Bo  37  7
          Bt   0 43
          Bth  1  0
          T    1 35
## Display the contingency table as a heat map
library(lattice)
levelplot(one.vs.others.lda.loo.xtab)

## Compute the hit rate
hits <- one.vs.others == one.vs.others.loo.predicted.class
errors <- one.vs.others  != one.vs.others.loo.predicted.class

## Compute the number of hits 
## (we need to omit NA values because LDA fails to assign a group to some objects).
(nb.hits <- sum(na.omit(hits)))
[1] 170
(nb.pred <- length(na.omit(hits)))
[1] 190
(hit.rate <- nb.hits / nb.pred )
[1] 0.8947368

Multi-group classification

Linear discriminant analysis also enables to direcly classify a set of individuals into multiple groups. We could for example use the original sample labels to train a classifier, which would learn to recognize the different subtypes of ALL.

## Use the MASS:lda() function with the cross-validation option
multigroup.lda.loo <- lda(t(denboer2009.expr[selected.genes,]),sample.labels,CV=TRUE) 

## Collect the LOO prediction result in a vector
multigroup.loo.predicted.class <- as.vector(multigroup.lda.loo$class)
table(multigroup.loo.predicted.class)
multigroup.loo.predicted.class
 Bc BEp BEs  Bh  BM  Bo  Bt   T 
  2   9   3  45   2  50  40  36 
## Build a confusion table of known versus predicted class
(multigroup.lda.loo.xtab <- table(sample.labels, multigroup.loo.predicted.class))
             multigroup.loo.predicted.class
sample.labels Bc BEp BEs Bh BM Bo Bt  T
          Bc   0   0   0  1  0  3  0  0
          Bch  0   0   0  0  0  0  0  0
          BE   0   0   0  0  0  0  0  0
          BEp  0   8   0  0  0  0  0  0
          BEs  0   1   2  0  0  1  0  0
          Bh   0   0   1 40  0  2  1  0
          BM   0   0   0  0  2  1  1  0
          Bo   1   0   0  1  0 42  0  0
          Bt   1   0   0  3  0  1 38  0
          Bth  0   0   0  0  0  0  0  0
          T    0   0   0  0  0  0  0 36
## Display the contingency table as a heat map
library(lattice)
levelplot(multigroup.lda.loo.xtab)

## Compute the hit rate
hits <- sample.labels == multigroup.loo.predicted.class
errors <- sample.labels != multigroup.loo.predicted.class

## Compute the number of hits 
## (we need to omit NA values because LDA fails to assign a group to some objects).
(nb.hits <- sum(na.omit(hits)))
[1] 168
(nb.pred <- length(na.omit(hits)))
[1] 187
(hit.rate <- nb.hits / nb.pred )
[1] 0.8983957

Random expectation for the hit rate

We could compute the random expectation of the hit rate. Let us imagine a tricky classifier that woudl assign the labels randomly, for two groups of equal sizes. Each object would have equal chances to be assigned to it correct class, and the random assignation would thus gvive 50% of correct classification.

Random expectation for a 2-groups permutation of training labels

However, in our case the training classes are unbalanced: \(n_{GOI}=44\) samples belong to the group of interest, and \(n_{others}\) to the other groups. If samples are re-assigned randomly with proportions equal to their prior frequencies, the assignation probabilities are:

\[p_{GOI}= n_{GOI}/n = 44/190=0.23\]

\[p_{other}= n_{GOI}/n = 146/190=0.77\]

A random assignation would thus assign correctly \(23\%\) of pre-ALL samples, and \(77\%\) of the samples of other subtypes would be correctly assigned to the group “others”. We can thus compute the random expectation for the hit rate.

\[\hat{hits}_{GOI} = n_{GOI} \cdot p_{GOI} + n_{others} \cdot p_{others} = 44 \cdot 0.232 + 146 \cdot 0.768 = 122.54\]

\[\hat{hit rate}_{GOI} = \hat{hits}_{GOI} / n = 122.54/190 = 0.64 = 64%\]

Note the same result is obtained directly by taking the sum of squares of prior probabilities:

\[\hat{hits}_{GOI} = p_{GOI}^2 + p_{others}^2 = 0.232^2 + 0.768^2 = 0.64 = 64%\]

n <- length(sample.labels)
n.goi <- sum(sample.labels == group.of.interest)
n.others <- sum(sample.labels != group.of.interest)
prior <- c("goi"=n.goi/n,
           "others" = n.others/n)
print(prior)
      goi    others 
0.2315789 0.7684211 
exp.hits <- c("goi"=(n.goi*prior["goi"]),
              "other"=(n.others*prior["others"]))
print(exp.hits)
     goi.goi other.others 
    10.18947    112.18947 
exp.hit.rate <- sum(prior^2)
print(exp.hit.rate)
[1] 0.6440997

A random classifier which would respect the balance of the training groups (and thus assign labels according to the group prior probabilities) would thus already achieve a hit rate of \(65\%\).

Random expectation for a multi-groups permutation of training labels

We can extend the above idea to estiamte the expected proportion of correspondances between permuted labels in a multi-group configuration.

multi.samples.per.class <- unlist(table(sample.labels))
multi.prior <- (multi.samples.per.class)/sum(multi.samples.per.class)
(multi.expect.hit.rate <- sum(multi.prior^2))
[1] 0.1975623

The random hit rate for permuted labels is the sum of square priors.

\[\hat{hit rate} = \sum_{i=1}^{g}{p_i}\]

Where \(g\) is the number of groups (ALL subtypes) and \(p_i\) the prior probability of the \(i- {th}\) group.


Exercise: training a classifier with permuted labels

  1. Use the R function sample() to randomly permute the training labels (store the result in a vector called “sample.labels.perm”), and test the hit rate of a classifier trained with these randomized labels.

  2. Use the R function sample() to randomly permute the values of the input table (store the result in a data frame called “expr.perm”), and test the hit rate of a classifier trained with these randomized data.

Solutions

We will successively run the permutation test with two configurations: 1. multi-group classification (using the original sample labels) 2. two-group classification (pre-B all versus others)

Label permutation test for multi-group classification

## Permute the training labels
sample.labels.perm <- as.vector(sample(sample.labels))

## Compare original training groups and permuted labels.
table(sample.labels, sample.labels.perm)
             sample.labels.perm
sample.labels Bc Bch BE BEp BEs Bh BM Bo Bt Bth  T
          Bc   0   0  0   0   0  2  0  1  1   0  0
          Bch  0   0  0   0   0  0  0  1  0   0  0
          BE   0   0  0   0   0  1  0  0  0   0  0
          BEp  0   0  0   0   1  2  0  1  2   0  2
          BEs  0   0  0   0   0  0  0  1  2   0  1
          Bh   2   0  0   4   1 11  1 13  9   0  3
          BM   0   0  1   0   0  1  0  1  1   0  0
          Bo   0   1  0   1   0  7  1  7 12   0 15
          Bt   2   0  0   2   1  9  2  9  9   0  9
          Bth  0   0  0   0   0  0  0  0  1   0  0
          T    0   0  0   1   1 11  0 10  6   1  6

We can now run the leave-one-out (LOO) cross-validation with the permuted labels.

## Run LDA in cross-validation (LOO) mode with the permuted labels
lda.loo.labels.perm <- lda(t(denboer2009.expr[selected.genes,]),sample.labels.perm,CV=TRUE) 

## Build a contingency table of known versus predicted class
loo.predicted.class.labels.perm <- as.vector(lda.loo.labels.perm$class)
lda.loo.labels.perm.xtab <- table(sample.labels.perm, loo.predicted.class.labels.perm)
print(lda.loo.labels.perm.xtab)
                  loo.predicted.class.labels.perm
sample.labels.perm Bc Bch BE BEp BEs Bh BM Bo Bt Bth  T
               Bc   0   0  0   0   0  0  1  0  3   0  0
               Bch  0   0  0   0   0  0  0  0  0   0  0
               BE   0   0  0   0   0  0  0  0  0   0  0
               BEp  0   0  0   0   0  3  0  1  3   1  0
               BEs  0   0  0   0   0  0  0  2  1   0  1
               Bh   1   1  0   2   1  9  2  8 14   0  6
               BM   1   0  0   0   0  0  0  2  1   0  0
               Bo   2   0  0   2   2 11  1  7  8   0 11
               Bt   2   0  1   1   1  8  0 12 13   0  5
               Bth  0   0  0   0   0  0  0  0  0   0  0
               T    1   0  0   0   2  7  0 10 10   0  6

The particular values sould vary at each trial, since the sample labels are permuted at random.

## Compute the number of hits 
## (we need to omit NA values because LDA fails to assign a group to some objects).
hits.label.perm <- sample.labels.perm == loo.predicted.class.labels.perm
(nb.hits.label.perm <- sum(na.omit(hits.label.perm))) ## This gives 25 this time, but should give different results at each trial
[1] 35
(nb.pred.label.perm <- length(na.omit(hits.label.perm))) ## This should give 187
[1] 187
(hit.rate.label.perm <- nb.hits.label.perm / nb.pred.label.perm ) ## This gives 0.13 this time, should give different results at each trial
[1] 0.1871658

It might seem surprizing to obtain an apparently decent hit rate (~75%) with the permutation test. Indeed, since sample labels were assigned at random, the classifer sould not be able to learn anything.

Label permutation test for two-group classification

## Permute the training labels
sample.labels.perm.2gr <- as.vector(sample(one.vs.others))

## Compare original training groups and permuted labels.
table(one.vs.others, sample.labels.perm.2gr)
             sample.labels.perm.2gr
one.vs.others  Bo   o
           Bo   8  36
           o   36 110
## Compute the percent of correspondence between training groups 
## and permuted labels
permuted.equal <- sum(diag(table(one.vs.others, sample.labels.perm.2gr)))
(permuted.equal.rate <- permuted.equal/length(one.vs.others))
[1] 0.6210526

After permutation, each of the permuted classes should contain on average ~23% of samples from the “Bo” subtype, and ~77% of sample from other subtypes (the precise numbers fary between repetitions of the random sampling).

Note that in the two-groups configuration we already have 64% of equality between the original and permuted labels. This rate is higher than 50% because the groups are unbalanced: a label “other” will frequently be permuted with another label “other”.

We can now run the leave-one-out (LOO) cross-validation with the permuted labels.

## Run LDA in cross-validation (LOO) mode with the permuted labels
lda.loo.labels.perm.2gr <- lda(t(denboer2009.expr[selected.genes,]),sample.labels.perm.2gr,CV=TRUE) 

## Build a contingency table of known versus predicted class
loo.predicted.class.labels.perm.2gr <- as.vector(lda.loo.labels.perm.2gr$class)
lda.loo.labels.perm.2gr.xtab <- table(sample.labels.perm.2gr, loo.predicted.class.labels.perm.2gr)
print(lda.loo.labels.perm.2gr.xtab)
                      loo.predicted.class.labels.perm.2gr
sample.labels.perm.2gr  Bo   o
                    Bo  10  34
                    o   13 133
## Compute the hit rate of the classifier trained with permuted labels
(hit.rate.label.perm.2gr <- sum(diag(lda.loo.labels.perm.2gr.xtab)) / sum(lda.loo.labels.perm.2gr.xtab))
[1] 0.7526316

It might seem surprizing to obtain an apparently good hit rate (~75%) with the permutation test. The lda() trained with permuted labels achieved a better classification than the expected hit for a random assignation. Does it mean that the program was able to learn anything from randomly permuted labels?

Actually, the permuted samples still contain a informative indication: the classes are unbalanced, with 77% of the samples belonging to the group labelled “other”. A “gambler” classifier that would assign all elements to the majority class would always achieve a better hit rate than a random classifier. On average, this “gambler” classifier would achieve a 77% hit rate in an unbalanced two-groups configuration like ours.

The lda() classifier cannot be properly speaking qualified of gambler. However, the confusion table reveals that, indeed, during the LOO test, the lda() classified tends to assign many samples to the “other” (more than their prior frequency), which explains its relatively high hit rate of ~75% in this permutation test.

In summary, we should always be very cautious when interpreting the results of supervised classification. Indeed, an apparently good hit rate might be achieved even in absence of any informative difference between the groups, by simple effects of the group sizes. This effect is particularly important for a two-group classification, but can also appear in multi-group classification with strongly unbalanced groups.

The reliability of a classifier should thus be estimated by comparing its performances on a real dataset with the results obtained in permutation tests.


Exercises

Exercise: Impact of the number of training variables

Test the impact of the number of variables used for training the classifier. Draw a plot representing the hit rate (Y axis) as a function of the number of top genes used for the LDA classification, using different selection criteria:

* Gene-wise variance
  * P-value of a Welch test (T-cells against all other types)
* ANOVA

Random expectation

  1. Generate a randomized expression matrix by sampling the values of the Den Boer dataset with the R function sample() (as we did in the practical on multiple testing corrections).

  2. Run the same procedure with this randomized matrix as we did above with the original expression matrix, and calculate hit rate with increasing number of variables selected by

  1. Variance
  2. Welch test (T-cells against all other types)
  1. Compare the hit rates obtained with the permuted matrix and with the original expression matrix.

Alternative methods for supervised classification

So far we used the historical method for supervised classification, Linear Discriminant Analysis (LDA), developed by Ronald Fisher in 1936. Many other methods have been developed to address the same problem.

As an additional exercise, we propose to estimate the accuracy of alternative classification methods and compare their performances to LDA.

Quadratic discrminant analysis (QDA)

(n.variables <- length(selected.genes))
[1] 20
## train a quadratic disctiminant analysis (QDA) classifier
one.vs.others.qda.loo <- qda(t(denboer2009.expr[selected.genes,]),one.vs.others,CV=TRUE) 
table(one.vs.others, one.vs.others.qda.loo$class)
             
one.vs.others  Bo   o
           Bo  26  18
           o    1 145
(one.vs.others.qda.loo.hit.rate <- sum(one.vs.others == one.vs.others.qda.loo$class)/n)
[1] 0.9
## Permutation test on labels with quadratic disctiminant analysis (QDA) classifier
label.perm.one.vs.others.qda.loo <- qda(t(denboer2009.expr[selected.genes,]),
                                        sample(one.vs.others, replace=FALSE),
                                        CV=TRUE) 
table(one.vs.others, label.perm.one.vs.others.qda.loo$class)
             
one.vs.others  Bo   o
           Bo   1  43
           o   18 128
(label.perm.one.vs.others.qda.loo.hit.rate <- 
   sum(one.vs.others == label.perm.one.vs.others.qda.loo$class)/n)
[1] 0.6789474
## Redo the same with LDA, for comparison
one.vs.others.lda.loo <- lda(t(denboer2009.expr[selected.genes,]),one.vs.others,CV=TRUE) 
table(one.vs.others, one.vs.others.lda.loo$class)
             
one.vs.others  Bo   o
           Bo  37   7
           o   13 133
(one.vs.others.lda.loo.hit.rate <- sum(one.vs.others == one.vs.others.lda.loo$class)/n)
[1] 0.8947368

K-nearest-neighbours (KNN) classifier}

TO BE DONE

Support vector machines}

TO BE DONE


Authors: Denis Puthier & Jacques van Helden