Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.

Loading data into R

Have a look at the description of the read.table function.
Load the expression matrix (GSE13425_Norm_Whole.txt), the A/P/M matrix (GSE13425_AMP_Whole.txt) and phenotypic data into R using the read.table function (assign the results to objects named data, amp, and pheno respectively).

Solution

?read.table
data <-  read.table("GSE13425_Norm_Whole.txt",sep="\t", head=T, row=1)
amp <- read.table("GSE13425_AMP_Whole.txt",sep="\t", head=T, row=1)
pheno <- read.table("phenoData_GSE13425.tab",sep="\t", head=T, row=1)

The file GSE13425_Norm_Whole.txt contains genes as rows and samples as columns. Data were previously normalized using rma algorithm (they are thus transformed in logarithm base 2). The phenoData_GSE13425.txt file contains phenotypic data about samples. The GSE13425_APM_Whole.txt file contains information about A/P/M calls (genes as rows and samples as columns).

Selecting a subset of genes

Selecting using the A/P/M criteria

First, we will select genes giving a significant signal in a given number of samples.

Solution

isPresent <- amp == "P"
ind <- rowSums(isPresent) >= 19  
data <- data[ind, ]

Selecting using standard deviation

As the classification of the whole gene matrix is rather computer intensive we will select 30% of the genes based on standard deviation.

Select these genes.
Change column names so that the new matrix will contains information about sample types.
Write data onto disk (file GSE13425_sub_1.txt).

Solution

sd <- apply(data,1,sd)
summary(sd)
quantile(sd,0.7)
data <- data[sd > quantile(sd,0.7),]
colnames(data) <- paste(pheno$Sample_title, colnames(data), sep="| |")
write.table(data, "GSE13425_sub_1.txt", sep="\t", quote=F, col.names=NA)

Hierarchical clustering with hclust

Euclidean distance is rarely used in the context of microarray analysis. A distance based on Pearson's correlation coefficient is most generally preferred (Spearman's rank correlation coefficient may also be used). Let's visualize the sample-sample correlation matrix using a heatmap.

Solution

pear <- cor(data, method="pearson")
palette <-colorRampPalette(c("yellow", "black","blueviolet"))
library(lattice)
levelplot(pear,col.regions=palette, scales=list(cex=0.2))

# we can also store the result as a high quality pdf file
pdf("coor.pdf"); levelplot(pear,col.regions=palette, scales=list(cex=0.2)); dev.off()

The Pearson's correlation coefficient is bounded between -1 and 1. We can transform it into a distance using the following command:

Solution

pear <- as.dist((1-pear)/2)

Using this distance matrix we will use the hclust function to perform hierarchical clustering of samples.

Solution

hp <- hclust(pear, method="average")
pdf("hp.pdf")
plot(hp,hang=-1, lab=pheno$Sample_title, cex=0.2)
dev.off()
system("evince hp.pdf&")

Hierarchical clustering with the cluster and treeview software.

R is not particularly well-suited to visualize classification results for very large datasets. We will thus install the cluster and treeview software that are very handy to browse the results of a hierarchical clustering

# These are shell/bash commands !
mkdir -p ~/bin
cd ~/bin
wget http://bonsai.hgc.jp/~mdehoon/software/cluster/cluster-1.50.tar.gz
tar xvfz cluster-1.50.tar.gz
cd cluster-1.50
./configure --without-x
make
echo -e "\nalias cluster=$PWD/src/cluster" >> ~/.bashrc
cd ..
wget http://sourceforge.net/projects/jtreeview/files/jtreeview/1.1.6r2/TreeView-1.1.6r2-bin.tar.gz
tar xvfz TreeView-1.1.6r2-bin.tar.gz
echo "alias javatreeview='java -jar  $PWD/TreeView-1.1.6r2-bin/TreeView.jar'" >> ~/.bashrc
source  ~/.bashrc
cd -

Using the cluster software, compute hierarchical clustering both on genes and samples.

Solution

cluster -f GSE13425_sub_1.txt -g 2 -e 2 -m a  -cg m
javatreeview

Now you can open the .cdt file, that was produced by cluster using javatreeview. How are the samples classified ? What can you say about gene classification ?

Microarray data analysis: unsupervized clustering.

Content

Retrieving the den Boer normalized dataset

Loading data into R

Solution

Selecting a subset of genes

Selecting using the A/P/M criteria

Solution

Selecting using standard deviation

Solution

Hierarchical clustering with `hclust`

Solution

Solution

Solution

Hierarchical clustering with the cluster and treeview software.

Solution

Microarray data analysis: unsupervized clustering.

Content

Retrieving the den Boer normalized dataset

Loading data into R

Solution

Selecting a subset of genes

Selecting using the A/P/M criteria

Solution

Selecting using standard deviation

Solution

Hierarchical clustering with hclust

Solution

Solution

Solution

Hierarchical clustering with the cluster and treeview software.

Solution

Hierarchical clustering with `hclust`