Login

Join for Free!
17146 members
table of contents table of contents

Biology Articles » Bioinformatics » Computational cluster validation in post-genomic data analysis » Figures

Figures
- Computational cluster validation in post-genomic data analysis

..................................................

Fig. 1 The three main steps involved in a cluster analysis. The first of these involves a number of data transformations including feature selection, normalization and the choice of a distance function, to ensure that related data items cluster together in the data space. The second step consists of the selection, parameterization and application of one or several clustering methods. The resulting partitionings are evaluated in the third step, and it is at this stage that cluster-validation techniques are needed. The results of a cluster analysis may be crucially affected by the choices made in the first two steps, and information on the quality of the partitioning can (and should) therefore be used to revise these choices.

Figure 1

..................................................

Fig. 2 Examples of datasets exhibiting compactness (A), connectedness (B) and spatial separation (C), respectively. Clearly, connectedness and spatial separation are related (albeit opposite) concepts. In principle, the cluster structure in the datasets B and C can be identified by a clustering algorithm based on either connectedness or on spatial separation, but not by one based on compactness.

Figure 2

..................................................

Fig. 3 Illustration of the linear and non-linear combination of two objectives. Here, the two objectives are to be minimized. On the left-hand side the solid lines indicate lines of equal measure under c = {alpha} · x + ß · y (linear combination), where {alpha} and ß are the relative weights assigned to objectives x and y. On the right-hand side the solid lines indicate lines of equal measure under c = x · y (non-linear combination). In both cases, the solutions on lines/curves closer to the origin are rated higher (c is smaller). A possible Pareto front, that is, the set of solutions that are Pareto optimal with respect to the two objectives, is shown as a dashed line. It can be seen that in both cases Solution A scores best under c, while all other Pareto optimal solutions are overlooked.

Figure 3

..................................................

Fig. 4 Two-dimensional datasets ‘Long’ and ‘Square’, both generated from Normal Distributions. Square contains four clusters with strong overlap that are difficult to detect for algorithms and measures based on connectivity or spatial separation. Long contains two elongated clusters that are difficult to detect for algorithms and measures based on compactness. Note that these two datasets are simplistic and have been selected for demonstration purposes only. While, for these particular data, the problems on Long could be overcome by normalization, it is realistic to assume that this may not always be possible for real datasets (which usually contain several differently shaped clusters).

Figure 4

..................................................

Fig. 5 Illustration of the biases of external validation measures. Shown are the results for k-means, SOM, SOTA, average link and single link on the Long dataset under (left) the F-measure, and (right) the adjusted F-measure (averages over 21 runs). The original F-measure values indicate that both single and average link clearly outperform k-means, SOM and SOTA for k = 2, by a margin of up to 0.2. However, this conceals the fact that for this cluster number all five algorithms have equally failed to identify the correct cluster structure on Long. While k-means, SOM and SOTA have split both clusters in the middle (minimizing variance), both agglomerative clustering algorithms have isolated outliers in one cluster and merged the bulk of the data in the second cluster (see Fig. 6). Only for k ≥ 5 does single link succeed in separating the two core clusters. However, the poor performance of all five algorithms for k = 2 becomes evident in the plot of the adjusted F-measure. In general, the normalization may not only correct the estimated absolute degree of quality, but may also correct the ordering between the solutions obtained for different numbers of clusters (e.g. average link for k = 2 and k = 10) or for different algorithms (e.g. average link and k-meansfor k = 9).

Figure 5

..................................................

Fig. 6 Clustering results on Long for (from left to right) k-means, average link, single link, SOM and SOTA for (top) k = 2 and (bottom) k = 6.

Figure 6

..................................................

Fig. 7 Illustration of the biases of internal validation measures. A simple null model is used: the random control data have been generated from a uniform random distribution within the bounds of the original data. Shown are the results for k-means, SOM, SOTA, average link and single link on the Long dataset under (top left) the Silhouette Width, (top right) the Dunn Index, (bottom left) variance, and (bottom right) connectivity (averages over 21 runs) on the original data and uniformly random control data. The performance curves obtained for both original and uniformly random control data are plotted to permit visual comparison. Considering the Silhouette Widths only, the plot for the original data seems to indicate that k = 2 yields a good partitioning for all five algorithms, with the agglomerative algorithms being the best performers. Moreover, single link is assessed to perform worse than all other methods for all k ≥ 5, a result that stands in obvious contrast to its true performance as verified by the adjusted F-measure (see Fig. 5). Only a comparison with the values obtained for random control data reveals that, for k = 2 and k = 3, k-means, SOM, SOTA and average link do in fact perform worse than for random data. In comparison with its performance on random control data, single link seems to perform well, but it is not possible to derive the correct number of clusters from the plot (there is no peak at k = 5). Interpretation of the other three measures is given in the text.

Figure 7

..................................................

Fig. 8 Illustration of the solution visualization in two-objective space. Shown are the solutions (averages over 21 runs) for k-means, SOM, SOTA, average link and single link on the Long dataset in a plot of connectivity versus variance (both to be minimized). The set of solutions returned by each clustering algorithm is summarized by an attainment surface (Fonseca and Fleming, 1996), which is the boundary in objective space that separates the region dominated by the attained solutions from the region that is not dominated. Owing to the trends of the two objectives (variance decreases for a higher number of clusters, while connectivity increases), the number of clusters in the solutions generally increases from the top left to the bottom right. The correct number of clusters for a given algorithm is expected to show as a strong ‘knee’ in its attainment front. This is because for the correct number of clusters we expect a relatively large drop in variance at little additional cost in connectivity. In the above plot, a clear ‘knee’ can be observed for single link at k = 5.

Figure 8

..................................................

Fig. 9 Stability-based analysis (averages over 21 runs) for (left) the Long and (right) the Square dataset. On Long, the stability-based method is evidently able to estimate the presence of a good single link clustering solution for k ≥ 5 (the stability plot peaks at k = 6), and the absence of good SOM and average link solutions (no significant peak in the stability plot). Yet, the results obtained for k-means, SOM and SOTA are disconcerting. For k-means, the stability-based analysis pinpoints a highly stable six-cluster solution. Further analysis shows that this solution is stable only due to k-means' tendency to converge to spherically shaped clusters, and is in fact highly sub-optimal (see Fig. 6). For SOM and SOTA, the stability-based analysis pinpoints a stable four-cluster solution for the random control data. For Square, where k-means, SOM, SOTA and average link perform comparably well (data not shown), the stability-based analysis correctly identifies the four-cluster solution for average link, k-means, SOM and SOTA (the stability plots for all four algorithms peak at k = 4). However, a comparison with the control curves reveals that, for k-means, SOM and SOTA, a four cluster solution is recommended with comparable confidence for uniformly random data. This shows that the conspicuous stability peak obtained for k-means, SOM and SOTA on the Square data is mainly an artefact of the square shape of the underlying data manifold.

Figure 9

..................................................

Fig. 10 Adjusted Rand Index, Silhouette Width, Dunn Index and stability (averages over 21 runs) for k-means, SOM, SOTA, average link and single link agglomerative clustering on the Leukemia test set. The evaluation under the adjusted Rand Index (comparing to the known class labels) shows that average link, k-means, SOTA and SOM perform robustly on these data. They identify the three main clusters (AML, B-lineage ALL and T-lineage ALL), and assign most of the samples correctly. Naturally, this is knowledge that would not be available in a real-life cluster analysis, and it is therefore interesting to see whether the results under the internal validation measures would have led to the same conclusion. The performance curves under the Silhouette Width clearly indicate the high quality of the three-cluster solution. This result is best seen when comparing with the results obtained under the null model and ‘removing’ the bias towards large numbers of clusters, which arises due to the small size of the dataset: for large numbers of clusters, the resulting partitionings contain many singleton clusters, which score highly under the Silhouette Width. The stability-based technique is less consistent: for k-means and SOM, the performance peak at k = 3 is well pronounced, but it is much weaker for SOTA and average link. Both the Silhouette Width and the stability-based method indicate the lack of structure in the single link solutions. The application of the Dunn Index is somewhat less successful: it fails to predict the insufficiency of single link, and it mis-estimates the number of clusters for average link.

Figure 10

..................................................

Fig. 11 Solution visualization in two-objective space. Shown are the solutions (averages over 21 runs) for k-means, SOM, SOTA, average link and single link on the Leukemia dataset in a plot of connectivity versus variance. The knee corresponding to the three-cluster solution is clearly pronounced. The visualization also shows the consistency among the k-means, SOM, SOTA and average link solutions for k = 2 and k = 3, which further increases the confidence in the correctness of these partitionings.

Figure 11

..................................................
Source: Bioinformatics 2005 21(15):3201-3212.

rating: 7.25 from 4 votes | updated on: 19 Jan 2007 | views: 1219 |

Rate article:







excellent!bad…