Abbrev | Meaning |
---|---|
ALL | acute lymphoblastic leukemia |
AML | acute myeloblastic leukemia |
DEG | differentiall expressed genes |
GEO | Gene Expression Omnibus database |
KNN | K-nearest neighbours |
LDA | linear discriminant analysis |
QDA | quadratic discriminant analysis |
PCA | principal component anlaysis |
LOO | leave-one-out |
CV |
## Check the requirement for some packages
packages <- c("knitr")
for (pkg in packages) {
if (!suppressPackageStartupMessages(require(pkg, quietly=TRUE, character.only = TRUE))) {
install.packages(pkg)
}
library(pkg, character.only = TRUE)
}
## Install some bioconductor packages
pkg <- "qvalue"
if (!suppressPackageStartupMessages(require(pkg, quietly=TRUE, character.only = TRUE))) {
source("http://bioconductor.org/biocLite.R")
biocLite();
biocLite(pkg)
}
library(pkg, character.only = TRUE)
## Load the curstom function t.test.multi() - see protocol on differential expression
## Load a custom the library for multiple t tests
url.stats4bioinfo <- "http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics"
source(file.path(url.stats4bioinfo, 'R-files/config.R'))
[1] "Data repository http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/data"
[1] "R scripts source http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/R-files"
[1] "Results will be saved to /Users/jvanheld/course_stats_bioinfo/results"
[1] "Figures will be saved to /Users/jvanheld/course_stats_bioinfo/figures"
source(file.path(url.stats4bioinfo, 'R-files/util/util_student_test_multi.R'))
In this tutorial, we will put in practice some fundamental concepts of supervised classification.
For the practical, we will use a cohort comprized of 190 samples from patients suffering from Acute Lymphoblastic Leukemia (ALL) from DenBoer (2009). The raw data has previously been retrieved from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/).
We first need to define the URL of the course (dir.base), from which we will download some pieces of R code and the data sets to analyze.
## Specify the URL of the base for the course
dir.base <- 'http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics'
dir.base.ASG1 <- 'http://pedagogix-tagc.univ-mrs.fr/courses/ASG1'
The following command loads a general configuration file, specifying the input directories (data, **R*** utilities) and creating output directories on your computer (the corresponding paths are stored in the variables dir.results
and dir.figures
).
## Load the general configuration file
source(file.path(dir.base, 'R-files', 'config.R'))
[1] "Data repository http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/data"
[1] "R scripts source http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/R-files"
[1] "Results will be saved to /Users/jvanheld/course_stats_bioinfo/results"
[1] "Figures will be saved to /Users/jvanheld/course_stats_bioinfo/figures"
setwd(dir.results)
print(paste("Result directory", dir.results))
[1] "Result directory /Users/jvanheld/course_stats_bioinfo/results"
We can now load the expression matrix, and check its dimensions.
Beware: this expression matrix weights 20Mb. The download can thus take time, depending on your internet connection. To avoid downloading it repeadedly, we will stored it in our local directory.
# ## Define the location of data directory and file containing expression profiles
# denboer.url <- file.path(dir.base, 'data', 'gene_expression','denboer_2009')
denboer.url <- file.path(dir.base.ASG1, "data", "marrays")
url.expr.matrix <- file.path(denboer.url, 'GSE13425_Norm_Whole.tab')
## Download and store a local copy of the expression matrix if not yet there.
file.expr.matrix <- file.path(dir.results, "GSE13425_Norm_Whole.tab")
if (!file.exists(file.expr.matrix)) {
message("Downloading expression matrix from URL: ", url.expr.matrix)
download.file(url = url.expr.matrix, destfile = file.expr.matrix)
message("Local copy of expression matrix stored in file: ", file.expr.matrix)
}
## Load the dataset from denboer2009
message("Loading expression matrix from file ", file.expr.matrix)
denboer2009.expr <- read.table(file.expr.matrix, sep = "\t", head = T, row = 1)
print(dim(denboer2009.expr))
[1] 22283 190
## Should give this: 22283 190
## Load the pheno table, which provides information about all the samples
## Load phenotypic data
pheno.url <- file.path(denboer.url, 'phenoData_GSE13425.tab')
pheno.file <- file.path(dir.results, 'phenoData_GSE13425.tab')
if (!file.exists(pheno.file)) {
message("Downloading pheno table from URL: ", pheno.url)
download.file(url = pheno.url, destfile = pheno.file)
message("Storing local copy of pheno table in file: ", pheno.file)
}
message("Loading pheno data from ", pheno.file)
denboer2009.pheno <- read.table(pheno.file, sep='\t', head=TRUE, row=1, comment.char="")
## Load a file with group descriptions (short label, color)
group.descr.url <- file.path(denboer.url, 'GSE13425_group_descriptions.tab')
group.descr.file <- file.path(dir.results, 'GSE13425_group_descriptions.tab')
if (!file.exists(group.descr.file)) {
message("Downloading group descriptions from URL: ", group.descr.url)
download.file(url = group.descr.url, destfile = group.descr.file)
message("Storing local copy of group descriptions in file: ", group.descr.file)
}
message("Loading group descriptions from ", group.descr.file)
group.descriptions <- read.table(group.descr.file, sep='\t', head=TRUE, row=1)
Once the whole data set has been loaded, the data frame denboer2009.expr
should contain 22,283 rows (genes) and 190 columns (samples).
We can now inspect the “phenotypic” data.
Remark: the data frame named denboer2009.pheno
contains the information defined as “phenotypic” for general analysis of microarrays. However, in the case of DenBoer data set, the different classes of ALL are characterized by genotypic characteristics (the mutation that caused the leukemia) rather than by phenotypic traits.
## Get the list of column names in the pheno table.
names(denboer2009.pheno)
[1] "Sample.title" "Sample.source.name.ch1" "Sample.characteristics.ch1" "Sample.description" "sample.labels" "sample.colors"
The column Sample.title indicates the cancer subtype corresponding to each sample. We can count the number of samples per subtype, and display them by decreasing group size.
## Print the subtypes associated to each sample
print(sample(denboer2009.pheno$Sample.title, size = 10))
[1] T-ALL TEL-AML1 T-ALL pre-B ALL E2A-rearranged (EP) pre-B ALL T-ALL TEL-AML1 T-ALL TEL-AML1
Levels: BCR-ABL BCR-ABL + hyperdiploidy E2A-rearranged (E-sub) E2A-rearranged (E) E2A-rearranged (EP) hyperdiploid MLL pre-B ALL T-ALL TEL-AML1 TEL-AML1 + hyperdiploidy
## Print the number of samples per cancer type
kable(data.frame("n"=sort(table(denboer2009.pheno$Sample.title),decreasing=T)))
n.Var1 | n.Freq |
---|---|
hyperdiploid | 44 |
pre-B ALL | 44 |
TEL-AML1 | 43 |
T-ALL | 36 |
E2A-rearranged (EP) | 8 |
BCR-ABL | 4 |
E2A-rearranged (E-sub) | 4 |
MLL | 4 |
BCR-ABL + hyperdiploidy | 1 |
E2A-rearranged (E) | 1 |
TEL-AML1 + hyperdiploidy | 1 |
For the sake of visualization, we also defined short sample labels corresponding to each ALL subtype.
## Print the subtypes associated to each sample
sample.labels <- as.vector(denboer2009.pheno$sample.labels)
sample.colors <- as.vector(denboer2009.pheno$sample.colors)
print(sample(sample.labels, size = 10))
[1] "BE" "T" "Bt" "Bt" "Bo" "T" "Bo" "Bh" "BM" "Bt"
## Print the number of samples per cancer type
kable(data.frame("n"=sort(table(sample.labels),decreasing=T)))
n.sample.labels | n.Freq |
---|---|
Bh | 44 |
Bo | 44 |
Bt | 43 |
T | 36 |
BEp | 8 |
Bc | 4 |
BEs | 4 |
BM | 4 |
Bch | 1 |
BE | 1 |
Bth | 1 |
The correspondence between sample titles (subtypes) and sample labels (abbreviated subtypes) is stored in group.labels. We also defined group-specific colors (group.colors) to highlight the different subtypes of ALL on the plots.
## Print the correspondences in a more readable way
kable(group.descriptions)
group.labels | group.colors | |
---|---|---|
T-ALL | T | darkgreen |
TEL-AML1 | Bt | red |
TEL-AML1 + hyperdiploidy | Bth | blue |
hyperdiploid | Bh | brown |
E2A-rearranged (E) | BE | green |
E2A-rearranged (EP) | BEp | orange |
E2A-rearranged (E-sub) | BEs | darkgray |
BCR-ABL | Bc | cyan |
BCR-ABL + hyperdiploidy | Bch | black |
MLL | BM | violet |
pre-B ALL | Bo | darkblue |
group.labels <- group.descriptions$group.labels
names(group.labels) <- row.names(group.descriptions)
group.colors <- group.descriptions$group.colors
names(group.colors) <- row.names(group.descriptions)
group.legend <- paste(sep=" = ",group.labels, row.names(group.descriptions))
An recurrent problem with microarray data is the large dimensionality of the data space. The dataset we are analyzing contains 190 samples (the subjects), each characterized by 22,283 gene expression values (the variables).
The number of variables (also called features, the genes in our case) thus exceeds by far the number of objects (samples in this case). This sitation is qualified of over-dimensionality, and poses serious problems for classification.
In unsupervised classification (clustering), the relationships between objects will be affected, since a vast majority of the features are likely to be uninformative, but will however contribute to the computed (dis)similarity metrics (whichever metrics is used). Overdimensionality will somewhat mask the signal (biologically relevant relationships between gene groups) with noise. In supervised classification, the effect may be even worse: a program will be trained to recognize classes on the basis of supriousn differences found in any combination of the input variables.
It is thus essential, for both unsupervised and unsupervised classification, to perform some feature selection before applying the actual classification.
To get some feeling about the data, we will compare the expression levels of two genes selected in an arbitrary way (resp. \(236^{th}\) and \(1213^{th}\) rows of the profile table).
## Plot the expression profiles of two arbitrarily selected genes
g1 <- 236
g2 <- 1213
x <- as.vector(as.matrix(denboer2009.expr[g1,]))
y <- as.vector(as.matrix(denboer2009.expr[g2,]))
plot(x,y,
col=sample.colors,
type="n",
panel.first=grid(col="black"),
main="Den Boer (2009), two arbitrary genes",
xlab=paste("gene", g1), ylab=paste("gene", g2))
text(x, y,labels=sample.labels,col=sample.colors,pch=0.5)
legend("topright",col=group.colors,
legend=group.legend,pch=1,cex=0.6,bg="white",bty="o")