Unsupervised classification methods aim at classifying objects without any prior knowledge of the class they belong to. This is a very common task that can be used to discover new structures or classes inside a given set of objects. Indeed, arranging objects based on a given criteria (that one needs to define) may lead to partitions that highlights the presence of objects of various classes. These unsupervized methods are widely used in genomics where objects (expression profiles, biological samples with various descriptor, sequences, motifs, genomic features,…) need to be classified in order to mine large dataset in search for similar objects related to known or novel classes and displaying jointly particular properties. These approaches have been applied in several reference papers in the context of biological sample classification leading to the discovery of novel tumor classes.
The expected results of the classification is a partition. A partition can be defined as the division of a collection of objects into subsets or clusters. In clustering approaches, each cluster is expected to contain very similar object (low within-group variance) while object from different clusters are expected to differ (high inter-group variance). These clusters can correspond to a simple division of the object collection but also to a set of hierarchically nested subsets. In the most popular approaches the clusters produced are non overlapping, meaning that each object is assign to only one cluster. These approaches are called hard (or crisp) clustering methods in contrast to fuzzy clustering (or soft clustering) in which one object can fall into several clusters. Although rather complexe approaches have been proposed, the most popular methods for unsupervised clustering are based on rather simple algorithms that rely on simple mathematical basis. Whatever the method used, one important aspect is to choose a reasonable metric to assess the similarity between object. Although other parameters may strongly influence the partitioning result, the choice of the metric will have a major influence on clustering. They are lots of existing metrics that can be applied to a set of objects. In the next section, we will focus on some of them that are frequently encountered in the context of genomic data analysis.
One of the most classical metric is the well-known euclidean distance. Let’s imagine two genes in a two dimensional space that could represent two biological samples. Thus these two genes could be represented as two points \(a\) and \(b\). The euclidean distance between these two genes can be represented in the sample space and corresponds to the physical distance between two points computed using pythagora’s formula with \(i=1,..,p\) corresponding to the samples:
\[d(a,b)=\sqrt{\sum_{i=1}^p (a_{i} - b_{i})^2}\]
We could also propose an alterntive representation in which the points would be the samples and the dimensions would correspond to genes. We could also propose an alternative representation in which the \(x\) axis would represent the samples and the \(y\) axis would represent the intensities. In this case we would represent the profiles of the two genes across the two samples \(s1\) and \(s2\). This three representations are depicted below. In this particular case, the two points, \(a\) and \(b\) for which the distance is to be computed are in a two-dimensional space. However, this distance can be generalized to any space with p dimensions. Let’s take an example with two points (genes) in a eight dimensional space. We will chose the third representation with samples names displayed on \(x\) axis and intensities displayed on \(y\) axis.The value for \(a_{i}^2 - b_{i}^2\) are displayed with dashed lines
# Preparation x window
col.vec <- c("black","red")
op <- par(no.readonly = TRUE)
par(mfrow=c(2,2), cex.main=0.7, mgp=c(2,1,0), mai=c(0.5,0.5,0.5,0.5))
# Genes as points and samples as dimensions
a <- c(1, 4)
b <- c(2, 3)
m <- rbind(a, b)
print(m)
## [,1] [,2]
## a 1 4
## b 2 3
colnames(m) <- c("s1", "s2")
plot(m , pch=16, xlim=c(0,3),
ylim=c(0,5), main="Genes as points and samples as dimensions",
col=col.vec,
panel.first=grid(lty=1))
suppressWarnings(arrows(a[1], a[2], b[1], b[2], angle=0))
suppressWarnings(arrows(a[1], a[2], b[1], a[2], angle=0, lty=3))
suppressWarnings(arrows(b[1], a[2], b[1], b[2], angle=0, lty=3))
text(m, lab=rownames(m), pos = 2, offset = 1)
text(7, 4,label="dist(a,b)")
# Sample as points and genes as dimensions
plot(t(m) , pch=16,
ylim=c(0,5), xlim=c(0,5),
main="Sample as points and genes as dimensions",
panel.first=grid(lty=1))
text(t(m), lab=colnames(m), pos = 2, offset = 1)
# x axis correspond to samples and the y axis represent the intensities
matplot(t(m) ,
ylim=c(0,5), xlim=c(0,3),
main="x axis for samples (n=2) and y axis for intensities",
ylab="Intensities",
xlab="samples",
xaxt = "n",
type="n")
grid(lty=1)
axis(1, 0:4, c("", "s1", "s2", "", ""))
suppressWarnings(arrows(1, a[1], 2, a[2], angle=0))
suppressWarnings(arrows(1, b[1], 2, b[2], angle=0))
matpoints(t(m) , pch=16,
col=col.vec)
# 8 dimensions: x axis correspond to samples and the y axis represent the intensities
a <- c(7, 7, 7, 7, 6, 6, 10, 10)
b <- c(8, 2, 5, 6, 1, 6, 1, 4)
m <- rbind(a, b)
matplot(t(m),
xlim=c(0,10), ylim=c(0,12),
main="x axis for samples (n=8) and y axis for intensities",
pch=16,
col="black",
ylab="Intensities",
xlab="samples",
lty=1,
type="n")
grid(lty=1)
for(i in 1:length(a)){
suppressWarnings(arrows(i, a[i] , i, b[i], angle=0, col=col.vec[i], lty=3))
}
matpoints(t(m),
pch=16,
type="b",
lty=1)
points(a, type="p", col=col.vec[1], pch=16)
points(b, type="p", col=col.vec[2], pch=16)