Microarray data analysis: unsupervized clustering.


Retrieving the den Boer normalized dataset

Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.

Download the full normalized dataset.

wget http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/GSE13425_Norm_Whole.txt
wget http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/GSE13425_AMP_Whole.txt
wget http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/phenoData_GSE13425.tab

Loading data into R

Start R.

View solution| Hide solution

The file GSE13425_Norm_Whole.txt contains genes as rows and samples as columns. Data were previously normalized using rma algorithm (they are thus transformed in logarithm base 2). The phenoData_GSE13425.txt file contains phenotypic data about samples. The GSE13425_APM_Whole.txt file contains information about A/P/M calls (genes as rows and samples as columns).

Selecting a subset of genes

Selecting using the A/P/M criteria

First, we will select genes giving a significant signal in a given number of samples.

View solution| Hide solution

Selecting using standard deviation

As the classification of the whole gene matrix is rather computer intensive we will select 30% of the genes based on standard deviation.

View solution| Hide solution

Hierarchical clustering with hclust

Euclidean distance is rarely used in the context of microarray analysis. A distance based on Pearson's correlation coefficient is most generally preferred (Spearman's rank correlation coefficient may also be used). Let's visualize the sample-sample correlation matrix using a heatmap.

View solution| Hide solution

The Pearson's correlation coefficient is bounded between -1 and 1. We can transform it into a distance using the following command:

View solution| Hide solution

Using this distance matrix we will use the hclust function to perform hierarchical clustering of samples.

View solution| Hide solution

Hierarchical clustering with the cluster and treeview software.

R is not particularly well-suited to visualize classification results for very large datasets. We will thus install the cluster and treeview software that are very handy to browse the results of a hierarchical clustering

# These are shell/bash commands !
mkdir -p ~/bin
cd ~/bin
wget http://bonsai.hgc.jp/~mdehoon/software/cluster/cluster-1.50.tar.gz
tar xvfz cluster-1.50.tar.gz
cd cluster-1.50
./configure --without-x
echo -e "\nalias cluster=$PWD/src/cluster" >> ~/.bashrc
cd ..
wget http://sourceforge.net/projects/jtreeview/files/jtreeview/1.1.6r2/TreeView-1.1.6r2-bin.tar.gz
tar xvfz TreeView-1.1.6r2-bin.tar.gz
echo "alias javatreeview='java -jar  $PWD/TreeView-1.1.6r2-bin/TreeView.jar'" >> ~/.bashrc
source  ~/.bashrc
cd -

Using the cluster software, compute hierarchical clustering both on genes and samples.

View solution| Hide solution

Now you can open the .cdt file, that was produced by cluster using javatreeview. How are the samples classified ? What can you say about gene classification ?