Abbrev Meaning
ALL acute lymphoblastic leukemia
AML acute myeloblastic leukemia
DEG differentiall expressed genes
GEO Gene Expression Omnibus database
KNN K-nearest neighbours
LDA linear discriminant analysis
QDA quadratic discriminant analysis
PCA principal component anlaysis
LOO leave-one-out
## Check the requirement for some packages
packages <- c("knitr")
for (pkg in packages) {
  if (!suppressPackageStartupMessages(require(pkg, quietly=TRUE, character.only = TRUE))) {
  library(pkg, character.only = TRUE)

## Install some bioconductor packages
pkg <- "qvalue"
if (!suppressPackageStartupMessages(require(pkg, quietly=TRUE, character.only = TRUE))) {
library(pkg, character.only = TRUE)

## Load the curstom function t.test.multi() - see protocol on differential expression
## Load a custom the library for multiple t tests
url.stats4bioinfo <- ""
source(file.path(url.stats4bioinfo, 'R-files/config.R'))
[1] "Data repository"
[1] "R scripts source"
[1] "Results will be saved to /Users/jvanheld/course_stats_bioinfo/results"
[1] "Figures will be saved to /Users/jvanheld/course_stats_bioinfo/figures"
source(file.path(url.stats4bioinfo, 'R-files/util/util_student_test_multi.R'))

Goal of this tutorial

In this tutorial, we will put in practice some fundamental concepts of supervised classification.

  1. Distinction between unsupervised (clustering) and supervised classification.
  2. Steps of an analysis: training, testing and prediction.
  3. Evaluation of the classification: cross-validation (CV), leave-one-out (LOO).
  4. The problem of over-dimensionality and the risk of overfitting
  5. Feature selection methods

Study case

For the practical, we will use a cohort comprized of 190 samples from patients suffering from Acute Lymphoblastic Leukemia (ALL) from DenBoer (2009). The raw data has previously been retrieved from the Gene Expression Omnibus (GEO) database (

R configuration

  • Open a terminal
  • Start R

We first need to define the URL of the course (dir.base), from which we will download some pieces of R code and the data sets to analyze.

## Specify the URL of the base for the course
dir.base <- ''
dir.base.ASG1 <- ''

The following command loads a general configuration file, specifying the input directories (data, **R*** utilities) and creating output directories on your computer (the corresponding paths are stored in the variables dir.results and dir.figures).

## Load the general configuration file
source(file.path(dir.base, 'R-files', 'config.R'))
[1] "Data repository"
[1] "R scripts source"
[1] "Results will be saved to /Users/jvanheld/course_stats_bioinfo/results"
[1] "Figures will be saved to /Users/jvanheld/course_stats_bioinfo/figures"
print(paste("Result directory", dir.results))
[1] "Result directory /Users/jvanheld/course_stats_bioinfo/results"

Data loading

We can now load the expression matrix, and check its dimensions.

Beware: this expression matrix weights 20Mb. The download can thus take time, depending on your internet connection. To avoid downloading it repeadedly, we will stored it in our local directory.

# ## Define the location of data directory and file containing expression profiles
# denboer.url <- file.path(dir.base, 'data', 'gene_expression','denboer_2009')
denboer.url <- file.path(dir.base.ASG1, "data", "marrays")
url.expr.matrix <- file.path(denboer.url, '')

## Download and store a local copy of the expression matrix if not yet there.
file.expr.matrix <- file.path(dir.results, "")
if (!file.exists(file.expr.matrix)) {
  message("Downloading expression matrix from URL: ", url.expr.matrix)
  download.file(url = url.expr.matrix, destfile = file.expr.matrix)
  message("Local copy of expression matrix stored in file: ", file.expr.matrix)

## Load the dataset from denboer2009
message("Loading expression matrix from file ", file.expr.matrix)
denboer2009.expr <- read.table(file.expr.matrix, sep = "\t", head = T, row = 1)
[1] 22283   190
## Should give this: 22283   190

## Load the pheno table, which provides information about all the samples
## Load phenotypic data
pheno.url <- file.path(denboer.url, '')
pheno.file <- file.path(dir.results, '')
if (!file.exists(pheno.file)) {
  message("Downloading pheno table from URL: ", pheno.url)
  download.file(url = pheno.url, destfile = pheno.file)  
  message("Storing local copy of pheno table in file: ", pheno.file)
message("Loading pheno data from ", pheno.file)
denboer2009.pheno <- read.table(pheno.file, sep='\t', head=TRUE, row=1, comment.char="")

## Load a file with group descriptions (short label, color)
group.descr.url <- file.path(denboer.url, '')
group.descr.file <- file.path(dir.results, '')
if (!file.exists(group.descr.file)) {
  message("Downloading group descriptions from URL: ", group.descr.url)
  download.file(url = group.descr.url, destfile = group.descr.file)  
  message("Storing local copy of group descriptions in file: ", group.descr.file)
message("Loading group descriptions from ", group.descr.file)
group.descriptions <- read.table(group.descr.file, sep='\t', head=TRUE, row=1)

Once the whole data set has been loaded, the data frame denboer2009.expr should contain 22,283 rows (genes) and 190 columns (samples).

We can now inspect the “phenotypic” data.

Remark: the data frame named denboer2009.pheno contains the information defined as “phenotypic” for general analysis of microarrays. However, in the case of DenBoer data set, the different classes of ALL are characterized by genotypic characteristics (the mutation that caused the leukemia) rather than by phenotypic traits.

## Get the list of column names in the pheno table.
[1] "Sample.title"               ""     "Sample.characteristics.ch1" "Sample.description"         "sample.labels"              "sample.colors"             

The column Sample.title indicates the cancer subtype corresponding to each sample. We can count the number of samples per subtype, and display them by decreasing group size.

## Print the subtypes associated to each sample
print(sample(denboer2009.pheno$Sample.title, size = 10))
 [1] T-ALL               TEL-AML1            T-ALL               pre-B ALL           E2A-rearranged (EP) pre-B ALL           T-ALL               TEL-AML1            T-ALL               TEL-AML1           
Levels: BCR-ABL BCR-ABL + hyperdiploidy E2A-rearranged (E-sub) E2A-rearranged (E) E2A-rearranged (EP) hyperdiploid MLL pre-B ALL T-ALL TEL-AML1 TEL-AML1 + hyperdiploidy
## Print the number of samples per cancer type
n.Var1 n.Freq
hyperdiploid 44
pre-B ALL 44
T-ALL 36
E2A-rearranged (EP) 8
E2A-rearranged (E-sub) 4
BCR-ABL + hyperdiploidy 1
E2A-rearranged (E) 1
TEL-AML1 + hyperdiploidy 1

For the sake of visualization, we also defined short sample labels corresponding to each ALL subtype.

## Print the subtypes associated to each sample
sample.labels <- as.vector(denboer2009.pheno$sample.labels)
sample.colors <- as.vector(denboer2009.pheno$sample.colors)

print(sample(sample.labels, size = 10))
 [1] "BE" "T"  "Bt" "Bt" "Bo" "T"  "Bo" "Bh" "BM" "Bt"
## Print the number of samples per cancer type
n.sample.labels n.Freq
Bh 44
Bo 44
Bt 43
T 36
BEp 8
Bc 4
BEs 4
BM 4
Bch 1
BE 1
Bth 1

The correspondence between sample titles (subtypes) and sample labels (abbreviated subtypes) is stored in group.labels. We also defined group-specific colors (group.colors) to highlight the different subtypes of ALL on the plots.

## Print the correspondences in a more readable way
group.labels group.colors
T-ALL T darkgreen
TEL-AML1 Bt red
TEL-AML1 + hyperdiploidy Bth blue
hyperdiploid Bh brown
E2A-rearranged (E) BE green
E2A-rearranged (EP) BEp orange
E2A-rearranged (E-sub) BEs darkgray
BCR-ABL Bc cyan
BCR-ABL + hyperdiploidy Bch black
MLL BM violet
pre-B ALL Bo darkblue
group.labels <- group.descriptions$group.labels
names(group.labels) <- row.names(group.descriptions)
group.colors <- group.descriptions$group.colors
names(group.colors) <- row.names(group.descriptions)
group.legend <- paste(sep=" = ",group.labels, row.names(group.descriptions))

Reducing the dimensionality of the data - feature selection

An recurrent problem with microarray data is the large dimensionality of the data space. The dataset we are analyzing contains 190 samples (the subjects), each characterized by 22,283 gene expression values (the variables).

The number of variables (also called features, the genes in our case) thus exceeds by far the number of objects (samples in this case). This sitation is qualified of over-dimensionality, and poses serious problems for classification.

In unsupervised classification (clustering), the relationships between objects will be affected, since a vast majority of the features are likely to be uninformative, but will however contribute to the computed (dis)similarity metrics (whichever metrics is used). Overdimensionality will somewhat mask the signal (biologically relevant relationships between gene groups) with noise. In supervised classification, the effect may be even worse: a program will be trained to recognize classes on the basis of supriousn differences found in any combination of the input variables.

It is thus essential, for both unsupervised and unsupervised classification, to perform some feature selection before applying the actual classification.

An arbitrary pair of genes

To get some feeling about the data, we will compare the expression levels of two genes selected in an arbitrary way (resp. \(236^{th}\) and \(1213^{th}\) rows of the profile table).

## Plot the expression profiles of two arbitrarily selected genes
g1 <- 236
g2 <- 1213
x <- as.vector(as.matrix(denboer2009.expr[g1,]))
y <- as.vector(as.matrix(denboer2009.expr[g2,]))
       main="Den Boer (2009), two arbitrary genes", 
      xlab=paste("gene", g1), ylab=paste("gene", g2))
text(x, y,labels=sample.labels,col=sample.colors,pch=0.5)