Home >> Expression documentation

Clustering methods


Available clustering methods in Babelomics are:


UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is an agglomerative hierarchical method (Sneath and Sokal, 1973), which is probably the most popular among the clustering algorithms in the field of microarrays. It is not the more accurate among the methods but is really extensively used (D’haeseler, 2005). Hierarchical clustering starts by calculating all-to-all distance matrix. The two closest patterns are merged and all-against-all distance matrix is calculated again but using the new cluster instead of the two merged patterns. This process is repeated until the complete dendrogram is built.

UPGMA method

In the previous figure hierarchical clustering is represented. A distance matrix all-against-all is obtained (top right of the figure) and the two closest elements (according to the distance function) are clustered. A) Genes B and C are the closest elements (red square in the matrix). B) A new matrix is recalculated with the elements B and C as a new pseudo-element. The distance of this element is obtained as the average distance (and the method is know as the average linkage; other ways of calculating this distance constitute different “flavours” of the method, often named as complete linkage, single linkage, etc.). C) Genes A and D are found to be closest in the distance matrix now and then are merged and the matrix rebuilt again. D) The process ends up when all the elements have been linked.


D’haeseler, P. (2005) How does gene expression clustering work? Nat. Biotech. 23:1499-1501.
Sneath, P. and Sokal, R. (1973) Numerical Taxonomy. W. H. Freeman, San Francisco.


SOTA is a divisive method developed by our group (Dopazo and Carazo, 1997; Herrero et al., 2001), which has recently become popular and has been included in several packages (such as the TMeV). SOTA starts the classification with a binary topology composed of a root node with two leaves. The self-organizing process splits the data (e.g. experimental gene profiles) into two clusters. After reaching convergence at this level, the network is inspected. If the level of variability in one, or more, terminal nodes is over a given threshold, then, the tree grows by expanding these terminal nodes. Two new descendants are generated from this heterogeneous node that becomes internal and does not receive direct updates from the data in future steps. The growth of the network is directed by the resource value that is defined as the mean value of the distances between a node and the expression profiles associated with it. This makes SOTA almost independent on cluster size, unlike SOM (Kohonen 1997).

SOTA method

In the previous figure SOTA method is represented. A) Schematic representation of the SOTA algorithm. The neurons that compose the network are represented as vectors, whose components correspond to the rows (or columns) in the original data matrix. On the top is the network in its initial state with two neurons connected through a mother neuron. In a cycle we repeat as many epoch as necessary to reach convergence (error below a given threshold). When a cycle ends, the network grows up through the node with the highest variability (which do not necessarily correspond to the more populated node). Then this node becomes mother node and generates two new daughter nodes. Then the process is repeated. B), C), D) and E) represent different consecutive steps of growing of the SOTA algorithm.

SOTA structures grow from the root of the tree toward the leaves, producing a representation of the data from lower to higher resolution. If the growth of the network is not restricted, the outcome would be a binary tree which contains only one profile in each leave. Nevertheless, if a threshold of resource value has been fixed, once all terminal nodes fall below such a threshold, the expansion of the binary topology will be stopped. This allows the generation of a hierarchical cluster structure at the desired level of resolution (see Herrero et al., 2001 for details).


Dopazo, J. and Carazo, J.M. (1997) Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol, 44, 226-233.
Herrero, J., Valencia, A. and Dopazo, J. (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17, 126-136.
Kohonen, T. (1997) Self-organizing maps. Springer-Verlag, Berlin.


K-means clustering (Hartigan 1979) is a simple and widely used partitioning method for data analysis. Its helpfulness in discovering groups of coexpressed genes has been demonstrated. The number of clusters k in the data is needed as an input for the algorithm, which then initialises the mean vector for each of the k clusters either by hard assignment (e.g., from the input) or random generation. These initial mean vectors are called the seeds. Next, The iterative procedure of the k-means algorithm, consisting of the following two steps, begins. 1) Using the given mean vectors, the algorithm assigns each genes (or experiments) to the cluster with the closest mean vector. 2) The algorithm recalculates the mean vectors (which are the sample means) for all the clusters.

The iterative procedure converges when all the mean vectors of the clusters remain stationary.A big problem associated with k-means algorithm, shared with SOM (Kohonen 1997), is the arbitrariness of predefine of the number of clusters, since it is difficult to predict the number of clusters in advance. In practice, this makes it necessary to use a trial-and-error approach where a comparison and biological validation of several runs of the algorithm with different parameter settings are necessary (Moreau et al., 2002). Another parameter that will influence the result of k-means clustering is the choice of the seeds. The algorithm suffers from the problem of local-minimum. This means that with different seeds, the algorithm can yield very different result. Preferably, the seeds should be chosen close to the centre of the natural clusters. Of course, this is hard to achieve if no prior knowledge about the clusters is available, which is often the case.

K-means method

In the previous figure dynamics of clustering by k-means is represented. Seeds are introduced and the mean change in successive iterations.


Hartigan, J. and Wong, M. (1979) A k-means clustering algorithm. Applied Statistics, 28, 100-108.
Kohonen, T. (1997) Self-organizing maps. Springer-Verlag, Berlin.
Moreau Y., De Smet F., Thijs G., Marchal K., and De Moor B( 2002) Functional Bioinformatics of microarray data: from expression to regulation. Proceedings of the IEEE, 90(11):1722–1743.


Depending on the feature one is interested on different distance functions can be used. There are two main families of distances:

  • euclidean, that depend on the point to point differences and accounts for absolute differences.
  • correlation that accounts for trends.


The previous figure shows the different results we can obtain using the different types of distances. Using an Euclidean distance, B and C patterns will be clustered with preference to A, on the basis of their lower absolute differences. Using a correlation, the two red patterns will be preferentially clustered because of their similar trends. Usually correlation has more biological meaning.

Effect of the standarization

In many cases standardising the data is a good practise. Standardisation (actually normalization in statistical textbooks, but we are call it standarisation therein just to not mix this concept with the concept of normalisation of the array) is a simple procedure by means of which we transform all the data to the same scale by subtracting the mean and dividing by the variance. When data are standardised, Euclidean distances account for trends. That is, there are equivalent to correlations.

Euclidean (normal)

The straight line distance between two points. See euclidean distance.

Euclidean (square)

Squared euclidean distance.

Correlation coefficient (Spearman)

A non-parametric measure of statistical dependence between two variables. The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable). See Spearman's correlation coefficient.

Pearson correlation coefficient

A measure of the linear dependence between two variables X and Y. Is defined as the covariance of the two variables divided by the product of their standard deviations. See Pearson's correlation coefficient.

upgma_method.jpg - UPGMA method (52.9 kB) Alicia Amadoz, 05/21/2010 07:02 am

sota_method.jpg - SOTA method (31.3 kB) Alicia Amadoz, 05/21/2010 07:24 am

kmeans_method.jpg - K-means method (21.1 kB) Alicia Amadoz, 05/21/2010 07:34 am

distances.jpg - distances (7.4 kB) Alicia Amadoz, 05/21/2010 07:59 am