Distance metrics and clustering

Introduction

Unsupervised classification methods aim at classifying objects without any prior knowledge of the class they belong to. This is a very common task that can be used to discover new structures or classes inside a given set of objects. Indeed, arranging objects based on a given criteria (that one needs to define) may lead to partitions that highlights the presence of objects of various classes. These unsupervized methods are widely used in genomics where objects (expression profiles, biological samples with various descriptor, sequences, motifs, genomic features,…) need to be classified in order to mine large dataset in search for similar objects related to known or novel classes and displaying jointly particular properties. These approaches have been applied in several reference papers in the context of biological sample classification leading to the discovery of novel tumor classes.

The expected results of the classification is a partition. A partition can be defined as the division of a collection of objects into subsets or clusters. In clustering approaches, each cluster is expected to contain very similar object (low within-group variance) while object from different clusters are expected to differ (high inter-group variance). These clusters can correspond to a simple division of the object collection but also to a set of hierarchically nested subsets. In the most popular approaches the clusters produced are non overlapping, meaning that each object is assign to only one cluster. These approaches are called hard (or crisp) clustering methods in contrast to fuzzy clustering (or soft clustering) in which one object can fall into several clusters. Although rather complexe approaches have been proposed, the most popular methods for unsupervised clustering are based on rather simple algorithms that rely on simple mathematical basis. Whatever the method used, one important aspect is to choose a reasonable metric to assess the similarity between object. Although other parameters may strongly influence the partitioning result, the choice of the metric will have a major influence on clustering. They are lots of existing metrics that can be applied to a set of objects. In the next section, we will focus on some of them that are frequently encountered in the context of genomic data analysis.

Choosing a metric

The euclidean distance.

One of the most classical metric is the well-known euclidean distance. Let’s imagine two genes in a two dimensional space that could represent two biological samples. Thus these two genes could be represented as two points \(a\) and \(b\). The euclidean distance between these two genes can be represented in the sample space and corresponds to the physical distance between two points computed using pythagora’s formula with \(i=1,..,p\) corresponding to the samples:

\[d(a,b)=\sqrt{\sum_{i=1}^p (a_{i} - b_{i})^2}\]

We could also propose an alterntive representation in which the points would be the samples and the dimensions would correspond to genes. We could also propose an alternative representation in which the \(x\) axis would represent the samples and the \(y\) axis would represent the intensities. In this case we would represent the profiles of the two genes across the two samples \(s1\) and \(s2\). This three representations are depicted below. In this particular case, the two points, \(a\) and \(b\) for which the distance is to be computed are in a two-dimensional space. However, this distance can be generalized to any space with p dimensions. Let’s take an example with two points (genes) in a eight dimensional space. We will chose the third representation with samples names displayed on \(x\) axis and intensities displayed on \(y\) axis.The value for \(a_{i}^2 - b_{i}^2\) are displayed with dashed lines

# Preparation x window
col.vec <- c("black","red")
op <- par(no.readonly = TRUE)
par(mfrow=c(2,2), cex.main=0.7, mgp=c(2,1,0), mai=c(0.5,0.5,0.5,0.5))

# Genes as points and samples as dimensions
a <- c(1, 4)
b <- c(2, 3)
m <- rbind(a, b)
print(m)

##   [,1] [,2]
## a    1    4
## b    2    3

colnames(m) <- c("s1", "s2")
plot(m , pch=16, xlim=c(0,3), 
     ylim=c(0,5), main="Genes as points and samples as dimensions",
     col=col.vec,
     panel.first=grid(lty=1))
suppressWarnings(arrows(a[1], a[2], b[1], b[2], angle=0))
suppressWarnings(arrows(a[1], a[2], b[1], a[2], angle=0, lty=3))
suppressWarnings(arrows(b[1], a[2], b[1], b[2], angle=0, lty=3))
text(m, lab=rownames(m), pos = 2, offset = 1)
text(7, 4,label="dist(a,b)")


# Sample as points and genes as dimensions
plot(t(m) , pch=16, 
     ylim=c(0,5), xlim=c(0,5), 
     main="Sample as points and genes as dimensions",
     panel.first=grid(lty=1))
text(t(m), lab=colnames(m), pos = 2, offset = 1)


# x axis correspond to samples and the y axis represent the intensities

matplot(t(m) , 
       ylim=c(0,5), xlim=c(0,3), 
       main="x axis for samples (n=2) and y axis for intensities", 
       ylab="Intensities",
       xlab="samples",
       xaxt = "n",
       type="n")
grid(lty=1)

axis(1, 0:4, c("", "s1", "s2", "", ""))
suppressWarnings(arrows(1, a[1], 2, a[2], angle=0))
suppressWarnings(arrows(1, b[1], 2, b[2], angle=0))

matpoints(t(m) , pch=16, 
       col=col.vec)


# 8 dimensions: x axis correspond to samples and the y axis represent the intensities
a <- c(7, 7, 7, 7, 6, 6, 10, 10)
b <- c(8, 2, 5,  6, 1, 6, 1, 4)
m <- rbind(a, b)
matplot(t(m),
        xlim=c(0,10), ylim=c(0,12),
        main="x axis for samples (n=8) and y axis for intensities",
        pch=16,
        col="black",
        ylab="Intensities",
        xlab="samples",
        lty=1,
        type="n")
grid(lty=1)

for(i in 1:length(a)){
  suppressWarnings(arrows(i, a[i] , i, b[i], angle=0, col=col.vec[i], lty=3))
}

matpoints(t(m),
        pch=16,
        type="b",
        lty=1)

points(a, type="p", col=col.vec[1],  pch=16)
points(b, type="p", col=col.vec[2],  pch=16)

Exercise:

Compute the euclidean distance between a and b.

Check that the same result is obtained with the dist() function used with default arguments.

Now assign the values of a to b.

What is the computed distance.

Add successively 10, 20, 30, 40 to each element/dimension of \(b\).

Compute the distance.

What do you observe ?

Set \(b\) values to \(-a + mean(a)\) then a to \(a - mean(a)\).

What is the distance obtained.

Now create a variable c with values \(a + 3\)

What is the distance obtained with a.

Plot the profile of a, b and c.

Is the euclidean distance a good choice for comparing these profiles ? Discuss it. In which context could it be useful ?

<< Hide | Show >>

par(op)

## The distance between a and b
(dist.euc  <- sqrt(sum((a-b)^2)))

## [1] 13.15295

# This can be also computed using the dist() function (requires) a matrix.
(dist.obj <- dist(rbind(a,b), method = "euclidean"))

##          a
## b 13.15295

# dist() return a dist object which can be converted to a matrix
(dist.mat <- as.matrix(dist.obj))

##          a        b
## a  0.00000 13.15295
## b 13.15295  0.00000

# Both lead to the same result
all.equal(dist.euc, dist.mat[1,2])

## [1] TRUE

# Now assign the values of a to b.
b <- a
dist(rbind(a,b), method = "euclidean") # the distance is 0

##   a
## b 0

# Add successively 10, 20, 30, 40 to each element/dimension of $b$.
range <- c(0, 10, 20, 30, 40)
b <- matrix(b, nc=8, nrow=5, byrow = TRUE)
b <- b + range
b <- rbind(b,a)
rownames(b) <- c("b+0","b+10", "b+20", "b+30", "b+40", "a")
plot(NA, xlim=c(1,8), 
     ylim=c(0,max(b)), type="n", 
     panel.first = grid(),
     ylab="Intensities",
     xlab="Samples")
matpoints(t(b), type="b", pch=16, lty=1, lwd=2)

# What we can observed is that although a and b have exactly the same profile 
# their distance increase as the absolute level of b is increasing.
# The euclidean distance has no upper limit and can thus increase indefinitely.
dist.a.b <- as.matrix(dist(b))[1:5,"a"]
text(2, range+2, labels=paste("d(a,", names(dist.a.b), ")=", round(dist.a.b,2), sep=""), cex=0.7, pos=3, offset=2)

# set $b$ values to -a + mean(a) then a to a - mean(a).
b <- -a + mean(a)
a <-  a - mean(a)
dist(rbind(a,b))

##          a
## b 8.485281

# Create vector c with values a + 6

c <- a + 3
dist(rbind(a,c))

##          a
## c 8.485281

matplot(t(rbind(a,b,c)), type="b", pch=1, lty=1, lwd=2, col=c("blue", "orange","red"))
grid()
legend(x="topleft", 
       legend = c("a","b","c"), 
       col=c("blue", "orange","red"), 
       lwd=2)

# As you can see the euclidean distance is unable to distinguish between profile displaying coordinate variations and profiles displaying clear anti-correlation.

Pearson’s correlation coefficient

The Pearson’s correlation coefficient is a widely used statistical score to measure the co-variance of two variables. It is particularly used in the case of hierarchical clustering applied to microarray or RNA-Seq data. The Pearson’s correlation coefficient can be computed by dividing the covariance of the two variables by the product of their standard deviations.

\[r_{x,y}= \frac{1}{n}\sum\limits_{i=1}^n \frac{(x_i - \bar{x})(y_i - \bar{y})}{s_xs_y} = \frac{cov(x,y)}{s_xs_y}\]

Let’s go back to the example we provided above. We had three objects (\(a\), \(b\) and \(c\)) having values associated with a set of variables. These object could be viewed as a small set of genes having given expression values in a set of biological samples. We will construct a new diagram in which we will display for each gene (a, b, and c) through each sample (8 dimensions) the difference to the corresponding mean value (\(x_i - \bar{x}\)). The covariance is the sum of the products of the differences through the eight dimensions. If for in dimension the intensities of two gene goes jointly above their corresponding mean, this product will be positive. In the same way, if their intensities goes jointly above their corresponding mean, this product will be positive. It will be negative or zero in the other cases. The fact that the covariance is divided by the product of the standard deviation will ensure that the Pearson’s \(r\) coefficient is bound between -1 (the more negative corresponding to anti-correlation) and 1 (complete correlation).

col.vec <- c(col.vec, "green")
# The dataset
a <- c(7, 7, 7, 7, 6, 6, 10, 10)
a <- a - mean(a)
b <- -a + mean(a)
c <- a + 3
m <- rbind(a,b,c)

# The mean intensities
a.mean <- mean(a)
b.mean <- mean(b)
c.mean <- mean(c)

# The diagram with the three genes: a, b, c
matplot(t(m),
       main="x axis for samples (n=8) and y axis for intensities",
        pch=16,
        col="black",
        lty=1,
        type="n")
grid(lty=1)

mean.all <- c(a.mean, b.mean, c.mean)
abline(h=mean.all, lty=1, col="darkgrey")

gene.list <- list(a,b,c)

# The differences to the means
for(g in 1:length(gene.list)){
  for(i in 1:length(a)){
    arrows(i, mean.all[g] , i, gene.list[[g]][i], angle=0, lty=2)
  }
}

matpoints(t(m),
        pch=16,
        type="b",
        lty=1)

Exercise:

Compute the Pearson’s correlation coefficient between \(a\) and \(c\) using the formula.

Is that results expected.

Compute it using the cor() function. Is the same result obtained ?

What is the result obtained when comparing \(a\) and \(b\) ?

Is this metric a good choice for comparing two gene expression profile ?

<< Hide | Show >>

n <- 8
# Compute sample standard deviation
sd.a <- sqrt(1/n * sum((a-a.mean)^2))
sd.c <- sqrt(1/n * sum((c-c.mean)^2))

# Compute Co-variance
cov.a.c <- 1/n*sum((a-a.mean) * (c-c.mean))
cor.a.c <- cov.a.c/(sd.a*sd.c)
cor.a.c

## [1] 1

all.equal(cor.a.c, cor(a,c))

## [1] TRUE

# Correlation between a and b. 
# a and b are anti-correlated
cor(a,b)

## [1] -1

This correlation coefficient can be simply transformed into a distance using the following formula:

\[d(x,y) = 1/2(1-r_{x,y})\]

For any gene pair with a Pearson’s correlation coefficient of 1 the corresponding distance will be 0 while those displaying value of -1 will have an associated distance of 1. It should be noted that in contrast to euclidean distance, this distance does not follow the rule of triangle inequality that states that for any triangle the sum of the length of two given sides must be greater than the length of the third side. This type of distance can thus not be used to represent gene distances in a euclidean space.

Spearman’s correlation coefficient \(\rho\)

Spearman’s correlation coefficient can be also particularly useful to assess the similarity of two given expression profile. While this rank-based measure is more robust to outliers than Pearson’s correlation coefficient, it is also less sensitive. While Pearson’s correlation coefficient is constructed based on raw values, Spearman’s correlation coefficient will assess the concordance between the ranks of these values. The formula of the Speaman ’s correlation coefficient is provided below. In this formula \(d\) is the difference between ranks.

\[\rho = {1- \frac {6 \sum d_i^2}{n(n^2 - 1)}}\]

As for the Pearson’s correlation coefficient, \(\rho\) can be transformed to a distance using:

\[d(x,y) = 1/2(1-\rho_{x,y})\]

Comparing Spearman and Pearson correlation coefficient results

Also lot of other metric exist, Spearman and Pearson correlation coefficient (or their associated distance) are frequently a good choice when working with expression data. However, depending on the context, it is advisable to use the appropriate coefficient since their results may greatly differ. One can keep in mind that Pearson will capture linear relationships while Spearman will capture monotonic relationships. Let’s take some example and compute the results of both coefficient.

op <- par(no.readonly = TRUE)
par(mfrow = c(2, 2), mar = c(3, 4, 0, 0.2)) #it goes c(bottom, left, top, right) 
## An example where the result of both approaches are consistant
set.seed(1)
a <- rnorm(20)
b <- a + rnorm(20,sd=0.1)
plot(a,b, pch=16, las=1, panel.first=grid(lty=1))
cor.a.b.pear <- paste("r=", round(cor(a,b), 3))
cor.a.b.spear <- round(cor(a,b, method="spearman"), 3)
cor.a.b.spear <- paste("\u03C1=", cor.a.b.spear, sep="")
legend("topleft", leg=c(cor.a.b.spear, 
                           cor.a.b.pear ))

## Another example where the result of both approaches are consistant
set.seed(1)
a <- rnorm(20)
b <- -a + rnorm(20,sd=0.1)
plot(a,b, pch=16, las=1, panel.first=grid(lty=1))
cor.a.b.pear <- paste("r=", round(cor(a,b), 3))
cor.a.b.spear <- round(cor(a,b, method="spearman"), 3)
cor.a.b.spear <- paste("\u03C1=", cor.a.b.spear, sep="")
legend("topright", leg=c(cor.a.b.spear, 
                           cor.a.b.pear ))

## An example with non linear relationship
a <- 1:50
b <- exp(a)
plot(a,b, pch=16, las=1, panel.first=grid(lty=1))
cor.a.b.pear <- paste("r=", round(cor(a,b), 3))
cor.a.b.spear <- round(cor(a,b, method="spearman"), 3)
cor.a.b.spear <- paste("\u03C1=", cor.a.b.spear, sep="")
legend("topleft", leg=c(cor.a.b.spear, 
                           cor.a.b.pear ))

## An example with an outlier.
## Rank concordance is observed only in one dimension
set.seed(1)
a <- rnorm(10)
b <- rnorm(10)
rank(a) - rank(b)

##  [1] -7  0 -1  9 -3 -1  3  1  1 -2

a[5] <- a[5] + 10
b[5] <- b[5] + 10
plot(a,b, pch=16, las=1, panel.first=grid(lty=1))
cor.a.b.pear <- paste("r=", round(cor(a,b), 3))
cor.a.b.spear <- round(cor(a,b, method="spearman"), 3)
cor.a.b.spear <- paste("\u03C1=", cor.a.b.spear, sep="")
legend("topleft", leg=c(cor.a.b.spear, 
                           cor.a.b.pear ))

Unsupervized clustering of microarray data

Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.

Loading data into R

Start R. Have a look at the description of the read.table function. Load the expression matrix (GSE13425_Norm_Whole.txt), the A/P/M matrix (GSE13425_AMP_Whole.txt) and phenotypic data into R using the read.table function (assign the results to objects named data, amp, and pheno respectively).

Name and URL	Description
Normalized data	The file GSE13425_Norm_Whole.txt contains genes as rows and samples as columns. Data were previously normalized using rma algorithm (they are thus transformed in logarithm base 2).
AMP calls	The GSE13425_APM_Whole.txt file contains information about A/P/M calls (genes as rows and samples as columns).
Phenotypic data	The phenoData_GSE13425.txt file contains phenotypic data about samples.

<< Hide | Show >>

#?read.table
data.url <- "http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/GSE13425_Norm_Whole.txt"
amp.url <- "http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/GSE13425_AMP_Whole.txt"
pheno.url <- "http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays/phenoData_GSE13425.tab"
data <-  read.table(data.url,sep="\t", head=T, row=1)
amp <- read.table(amp.url,sep="\t", head=T, row=1)
pheno <- read.table(pheno.url,sep="\t", head=T, row=1)

Selecting a subset of genes

Selecting using the A/P/M criteria

First, we will select genes giving a significant signal in a given number of samples.

Select genes which are call present in at least 10% of the samples (n= 10724 genes).

<< Hide | Show >>

isPresent <- amp == "P"
ind <- rowSums(isPresent) >= 19  
data <- data[ind, ]

Selecting using standard deviation.

As the classification of the whole gene matrix is rather computer intensive we will select 30% of the genes based on standard deviation.

Select these genes.
Change column names so that the new matrix will contains information about sample types.
Write data onto disk (file GSE13425_sub_1.txt).

<< Hide | Show >>

sd <- apply(data,1,sd)
summary(sd)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.08143 0.26830 0.41800 0.47610 0.59400 2.45800

quantile(sd,0.7)

##       70% 
## 0.5517655

data <- data[sd > quantile(sd,0.7),]
colnames(data) <- paste(pheno$Sample_title, colnames(data), sep="| |")
dir.create("~/distance_and_clustering",showWarnings = FALSE)
setwd("~/distance_and_clustering")
write.table(data, "GSE13425_sub_1.txt", sep="\t", quote=F, col.names=NA)

About phenotypic information

Here are the available phenotypic information.

colnames(pheno)

## [1] "Sample.title"               "Sample.source.name.ch1"    
## [3] "Sample.characteristics.ch1" "Sample.description"        
## [5] "sample.labels"              "sample.colors"

We will store some of these information in two vectors:

One contain phenotype title and the GEO IDs as labels.
One contain a set of colors related to the phenotypes and the GEO IDs as labels.

pheno.vec <- as.character(pheno$Sample.title)
names(pheno.vec) <- as.character(rownames(pheno))
pheno.vec.col <- as.character(pheno$sample.colors)
names(pheno.vec.col) <- as.character(rownames(pheno))

Hierarchical clustering with hclust

Euclidean distance is rarely used in the context of microarray analysis. A distance based on Pearson’s correlation coefficient is most generally preferred (Spearman’s rank correlation coefficient may also be used). Let’s visualize the sample-sample correlation matrix using a heatmap.

<< Hide | Show >>

pear <- cor(data, method="pearson")
palette <-colorRampPalette(c("yellow", "black","blueviolet"))
library(lattice)
levelplot(pear,col.regions=palette, scales=list(cex=0.2))

# we can also store the result as a high quality pdf file
pdf("coor.pdf"); levelplot(pear,col.regions=palette, scales=list(cex=0.2)); dev.off()

## quartz_off_screen 
##                 2

The Pearson’s correlation coefficient is bounded between -1 and 1. We can transform it into a distance using the following command:

pear.dist <- as.dist((1-pear)/2)

Using this distance matrix use the hclust function to perform hierarchical clustering of samples.

<< Hide | Show >>

hp <- hclust(pear.dist, method="average")
plot(hp, hang=-1, labels=as.character(pheno$Sample.title), cex=0.2)

# we can also store the result as a high quality pdf file
pdf("hp.pdf")
plot(hp, hang=-1, labels=as.character(pheno$Sample.title), cex=0.2)
dev.off()

## quartz_off_screen 
##                 2

#system("evince hp.pdf&")

The objects resulting from hclust calls can also be used to construct a heatmap. This can be done for instance using the heatmap.3 function. Again, take care to the arguments used to construct the tree.

<< Hide | Show >>

library("GMD")
library("RColorBrewer")

# Median center rows
data.med.center <- sweep(data, MARGIN=1, STATS=apply(data,1,median), FUN = "-",) 
# floor and ceil

data.med.center[ data.med.center > 5 ] <- 5
data.med.center[ data.med.center < -5 ] <- -5

# Compute hierarchical clustering trees for rows and columns
pear.cols <- cor(t(data.med.center), method="pearson")
pear.dist <- as.dist((1-pear.cols)/2)
hc.cols <- hclust(pear.dist, method="average")
pear.rows <- cor(data.med.center, method="pearson")
pear.dist <- as.dist((1-pear.rows)/2)
hc.rows <- hclust(pear.dist, method="average")

# Prepare a function that returns interpolated color sets
col.fun = colorRampPalette(c("green", "black", "red"))



library("GMD")
library("RColorBrewer")

x11(height=14, width=14)
heatmap.3(t(data.med.center), 
          scale="none", 
          RowIndividualColors = pheno.vec.col, 
          cluster.by.col=TRUE, 
          cluster.by.row=TRUE, 
          Rowv=TRUE,
          Colv=TRUE,
          hclust.col = hc.cols, 
          hclust.row = hc.rows, 
          labRow=pheno.vec,
          labCol=NULL,
          cexRow = 0.5,
          color.FUN = col.fun)

## Warning in heatmap.3(t(data.med.center), scale = "none",
## RowIndividualColors = pheno.vec.col, : Discrepancy: row dendrogram is
## asked; Rowv is set to `TRUE'.

## Warning in heatmap.3(t(data.med.center), scale = "none",
## RowIndividualColors = pheno.vec.col, : Discrepancy: col dendrogram is
## asked; Colv is set to `TRUE'.

## 1.dim(x): 190 3217

## heatmap.3 | From GMD 0.3.3, please use relative values for cexRow.

## Preparing `hclust'... Preparing `dendrogram'... Reordering ... 2.dim(x): 190 3217 
## Scaling ... 3.dim(x): 190 3217 
## Making color breaks ... 4.dim(x): 190 3217 
## [1] "kr=2,kc=2"
## 5.dim(x): 190 3217 
## Plotting ... revC: FALSE , revR: FALSE 
## 5b.dim(x): 190 3217 
## 5c.dim(x): 190 3217 
## scale: none 
## 5d.dim(x): 3217 190

## 6.dim(x): 3217 190 
## 7.dim(x): 3217 190

## 8.dim(x): 3217 190 
## DONE!

dev.off()

## quartz_off_screen 
##                 2

However, one can see the limits of using R in this context…

Hierarchical clustering with the cluster and treeview software

R is not particularly well-suited to visualize classification results for very large datasets. We will thus install the cluster and treeview software that are very handy to browse the results of a hierarchical clustering.

<< Hide | Show >>

# These are shell/bash commands !
mkdir -p ~/bin
cd ~/bin
wget http://bonsai.hgc.jp/~mdehoon/software/cluster/cluster-1.50.tar.gz
tar xvfz cluster-1.50.tar.gz
cd cluster-1.50
./configure --without-x
make
echo -e "\nalias cluster=$PWD/src/cluster" >> ~/.bashrc
cd ..
wget http://sourceforge.net/projects/jtreeview/files/jtreeview/1.1.6r2/TreeView-1.1.6r2-bin.tar.gz
tar xvfz TreeView-1.1.6r2-bin.tar.gz
echo "alias javatreeview='java -jar  $PWD/TreeView-1.1.6r2-bin/TreeView.jar'" >> ~/.bashrc
source  ~/.bashrc
cd -

Using the cluster software, run hierarchical clustering on both genes and samples.

<< Hide | Show >>

cluster -f GSE13425_sub_1.txt -g 2 -e 2 -m a  -cg m
javatreeview

Now you can open the .cdt file, that was produced by cluster using javatreeview. How are the samples classified ? What can you say about gene classification ?

Kmeans clustering with R

Look at the tree produced by Treeview. How many classes would you guess from this tree. Use the Kmeans() function from the amap package to perform a kmeans with the number of classes you guessed from the hierachical clustering. Look carefully at the parameters. What is the default distance used here ? Is that the right distance to use in this context ? Use the matplot() function to draw the profiles of each classes. Compare you results with other students (that may have choosen a different number of classes).

<< Hide | Show >>

## Reading the file (if required)
data <- read.table("GSE13425_sub_with_pheno.txt", sep="\t", head=T, row.names = 1)

## Performing Kmeans
library(amap)
k <- 9
km <- Kmeans(data, centers=k, method="pearson", iter.max = 100)

par(mfrow=c(3,3))
for(i in 1:k){

  cluster_k <- data[names(km$cluster[km$cluster==i]),]
  cluster_k <- sweep(cluster_k, 
                     MARGIN = 1, 
                     FUN = "-", 
                     STATS = apply(cluster_k, 1, median))
  matplot(t(cluster_k), 
          type="l", 
          col="gray", 
          ylab="Intensity",
          ylim=c(-8,8),
          main=paste("Cluster ", i, " #gene = ", nrow(cluster_k)))
  matpoints(colMeans(cluster_k), type="l", col="blue")
}

Perform the same clustering using the kmeans version implemented in cluster 3.0. Load the results in Java Treeview.

NB: if required, the filtered dataset can be obtained here.

Distance metrics and clustering

Denis Puthier

2017-11-21

Introduction

Choosing a metric

The euclidean distance.

Pearson’s correlation coefficient

Spearman’s correlation coefficient \(\rho\)

Comparing Spearman and Pearson correlation coefficient results

Unsupervized clustering of microarray data

Loading data into R

Selecting a subset of genes

Selecting using the A/P/M criteria

Selecting using standard deviation.

About phenotypic information

Hierarchical clustering with hclust

Hierarchical clustering with the cluster and treeview software

Kmeans clustering with R