2. BACKGROUND
2.1 Clustering
Traditional classifications of clustering algorithms (Duda et al., 2001; Everitt, 1993; Hastie et al., 2001; Jain et al., 1999) primarily distinguish between hierarchical, partitioning and density-based methods. Here, a somewhat different categorization is used, based on the clustering criterion (implicitly or explicitly) optimized by each algorithm. This permits a better appreciation of the connections between clustering algorithms and cluster-validation techniques. Capturing the intuitive notion of a cluster by means of any explicit, formal definition is one of the fundamental difficulties of clustering (Estivill-Castro, 2002). There are several valid properties that may be ascribed to a good partitioning, but these are partly in conflict and are generally difficult to express in terms of objective functions. Despite this, existing clustering criteria/algorithms do fit broadly into three fundamental categories:
- Compactness. This concept is generally implemented by keeping the intra-cluster variation small. This category includes algorithms like k-means (MacQueen, 1967), average-link agglomerative clustering (Vorhees, 1985), self-organizing maps (SOMs) (Kohonen, 2001) or model-based clustering approaches (McLachlan and Krishman, 1997). The resulting methods tend to be very effective for spherical or well-separated clusters, but they may fail to detect more complicated cluster structures (Duda et al., 2001; Everitt, 1993; Hastie et al., 2001; Jain et al., 1999).
- Connectedness. This is a more local concept of clustering based on the idea that neighbouring data items should share the same cluster. Algorithms implementing this principle are density-based methods (Ankerst et al., 1999; Ester et al., 1996) and methods such as single-link agglomerative clustering (Vorhees, 1985). They are well-suited for the detection of arbitrarily shaped clusters, but can lack robustness when there is little spatial separation between the clusters.
- Spatial separation. Spatial separation on its own is a criterion that gives little guidance during the clustering process and can easily lead to trivial solutions. It is therefore usually combined with other objectives, most notably measures of compactness or balance of cluster sizes. The resulting clustering objectives can be tackled by general-purpose meta-heuristics such as simulated annealing, tabu search and evolutionary algorithms (Bandyopadhyay and Manlik, 2001; Rayward-Smith et al., 1996).
Each of these categories is illustrated in Figure 2.
2.2 Clustering in post-genomic data analysis
Unsupervised classification has many applications in post-genomics. In particular, clustering plays a crucial role in the analysis of gene-expression data (Eisen, 1998; Golub et al., 1999; Quackenbush, 2001; Slonim, 2002). Clustering can also be applied directly to sequence the data, for example to group genes based on shared cis-regulatory regions (Bilu and Linial, 2002). It serves as a data-mining tool to analyse both proteomics and metabolomics data (Goodacre et al., 1998), and can be applied in the context of protein comparison and structure prediction (Kaplan et al., 2004; Krasnogor and Pelta, 2004). Recently, there have been numerous advances in the development of improved clustering techniques for post-genomic data analysis. Prominent examples include biclustering techniques (Madeira and Oliveira, 2004) and gene shaving (Hastie et al., 2000), which have both been specifically designed to deal with the particular challenges posed by gene expression data. Despite such advances, traditional clustering techniques such as hierarchical clustering algorithms (Eisen, 1998), k-means (Tavazoie et al., 1999), fuzzy c-means (Gasch and Eisen, 2002), finite mixture models (Yeung et al., 2001b) and SOMs (Tamayo et al., 1999) remain the predominant methods in post-genomics—a fact that is arguably more owing to their conceptual simplicity and their wide availability in standard software packages than to their intrinsic merits.
2.3 The need for cluster-validation measures in post-genomic data analysis
While cluster analyses are, potentially, a tool to speed up and semi-automate data processing, the majority of cluster analyses carried out on post-genomic data to date are quite far from this end. This is partly owing to the difficulties of the data tackled. Post-genomic data are typically high-dimensional, contain many more variables than samples, have high levels of noise and may have multiple missing values; these properties pose problems to many traditional clustering methods and makes the cluster analysis very challenging. However, there is hardly any consensus on the best distance function, clustering method or method of feature selection to be used for the different types of post-genomic data. As a consequence, it is common practice among researchers to employ a variety of different clustering techniques to analyse a dataset, and to use visual inspection and prior biological knowledge to select what is considered the most ‘appropriate’ result. Clearly, this process of data analysis is highly subjective, and may be a dangerous endeavour. In particular, researchers may unwittingly overrate clusters that reinforce their own assumptions, and ignore surprising or contradictory results. This is, of course, counterproductive to the unspoken aim of unsupervised classification, which is to identify surprising or unexpected patterns in the data that may then serve for hypothesis generation (Kell and Oliver, 2004). Most importantly, while the use of prior biological knowledge and assumptions may be necessary and important in the final interpretation of a cluster analysis, it is not an acceptable means of replacing an unsupervised validation step, in which the significance of individual clusters in terms of the underlying data distribution is verified.
The fact that a validation step is needed follows from the following two issues that arise when using clustering algorithms:
- Bias of clustering algorithms towards particular cluster properties. Clustering algorithms are biased towards partitions that are in accordance with their own clustering criterion. This is at the bottom of the fundamental discrepancies observable between the solutions produced by different algorithms.
- Non-significance of results in the absence of natural clusters. Unsupervised classification relies on the existence of a distinct structure within the data. However, most clustering algorithms return a clustering even in the absence of actual structure, leaving it to the user to detect the lack of significance of the results returned.
Either of the above may lead to a lack of compliance between a partitioning and the underlying data distribution, a situation which can be detected using computational cluster-validation techniques.