Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.
Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.
Download the full normalized dataset.
wget http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/GSE13425_Norm_Whole.txt wget http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/GSE13425_AMP_Whole.txt wget http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/phenoData_GSE13425.tab
The file GSE13425_Norm_Whole.txt contains genes as rows and samples as columns. Data were previously normalized using rma algorithm (they are thus transformed in logarithm base 2). The phenoData_GSE13425.txt file contains phenotypic data about samples. The GSE13425_APM_Whole.txt file contains information about A/P/M calls (genes as rows and samples as columns).
First, we will select genes giving a significant signal in a given number of samples.
As the classification of the whole gene matrix is rather computer intensive we will select 30% of the genes based on standard deviation.
Euclidean distance is rarely used in the context of microarray analysis. A distance based on Pearson's correlation coefficient is most generally preferred (Spearman's rank correlation coefficient may also be used). Let's visualize the sample-sample correlation matrix using a heatmap.View solution| Hide solution
The Pearson's correlation coefficient is bounded between -1 and 1. We can transform it into a distance using the following command:View solution| Hide solution
Using this distance matrix we will use the hclust function to perform hierarchical clustering of samples.View solution| Hide solution
R is not particularly well-suited to visualize classification results for very large datasets. We will thus install the cluster and treeview software that are very handy to browse the results of a hierarchical clustering
# These are shell/bash commands ! mkdir -p ~/bin cd ~/bin wget http://bonsai.hgc.jp/~mdehoon/software/cluster/cluster-1.50.tar.gz tar xvfz cluster-1.50.tar.gz cd cluster-1.50 ./configure --without-x make echo -e "\nalias cluster=$PWD/src/cluster" >> ~/.bashrc cd .. wget http://sourceforge.net/projects/jtreeview/files/jtreeview/1.1.6r2/TreeView-1.1.6r2-bin.tar.gz tar xvfz TreeView-1.1.6r2-bin.tar.gz echo "alias javatreeview='java -jar $PWD/TreeView-1.1.6r2-bin/TreeView.jar'" >> ~/.bashrc source ~/.bashrc cd -
Using the cluster software, compute hierarchical clustering both on genes and samples.
Now you can open the .cdt file, that was produced by cluster using javatreeview. How are the samples classified ? What can you say about gene classification ?