Login

Join for Free!
17726 members
table of contents table of contents

Biology Articles » Bioinformatics » Computational cluster validation in post-genomic data analysis » Measure-specific biases

Measure-specific biases
- Computational cluster validation in post-genomic data analysis

4. MEASURE-SPECIFIC BIASES 

In the following section, some of the major factors affecting certain validation techniques are discussed, and are demonstrated on two exemplary two-dimensional datasets (shown in Fig. 4). For the purpose of this study, several advanced techniques are selected from the categories described above. These are the F-measure (takes values in [0,1], to be maximized), the adjusted Rand Index (takes values in [0,1], to be maximized), variance (to be minimized), connectivity (to me minimized), Silhouette Width (takes values in [–1,1], to be maximized), the Dunn Index (to be maximized) and a stability-based method (takes values in [0,1], to be maximized). Experiments are conducted using five different clustering algorithms namely the partitioning method k-means, SOM, the self-organizing tree algorithm (SOTA) and two agglomerative hierarchical algorithms based on the linkage criteria of single link and average link, respectively. Enlarged colour plots and implementation details for the individual measures and algorithms are provided in the Supplementary Material.

 
4.1 Biases of external measures
External validation techniques suffer from biases with respect to the number of clusters, the distribution of cluster sizes and the distribution of class sizes in a partitioning (Halkidi et al., 2001; Hubert, 1985). For example, the completeness of a cluster trivially obtains the maximum possible value of 1 for a one-cluster partitioning and tends to decrease with an increasing number of clusters. On a different line, the F-measure and the Rand Index tend to be overly optimistic in situations where relatively small clusters have been overlooked.

These unwanted effects can be alleviated through normalization by the results expected for random data. The adjusted measure Ea is obtained as Ea=(EEe)/(EmEe), where Ee is the expected value for random data and Em is the maximum attainable value of the index. Ideally, this adjusted index Ea will then be limited to the interval [0,1]. Most commonly, this procedure has been applied to the Rand Index for which the expected value (and thus the adjusted Rand Index) can be computed by exact statistical methods (Hubert, 1985). However, the use of normalization is generally useful for any external measure. In cases where the expected value cannot be statistically derived, it can be approximated using Monte Carlo simulation. For a given partition, a number of ‘random partitions’ are generated, which agree with the original partition in the number of clusters, the distribution of cluster sizes and the distribution of class sizes. Each of these partitions is evaluated under the validation measure considered, and the average value is taken as an approximation of the expected value Ee. The required random partitionings may simply be obtained by permuting the cluster labels in the original partitioning.

This normalization of external indices is crucial to obtain an objective picture of the real, absolute and relative performance of an algorithm. Figure 5 illustrates the degree to which the values returned by the F-measure can be misleading, if un-normalized. In order to facilitate the interpretation of these performance curves, a selection of the actual clustering results is shown in Figure 6.

  
4.2 Biases of internal measures
Just like external measures, most internal measures suffer from biases with regard to the number of clusters. More importantly, internal measures may additionally exhibit biases with regard to the shape of the underlying data manifold and the structure of a partitioning. Some of these biases can be detected through a comparison with the results expected for random data. Computation of the expected value Ie for an internal measure I can be done by Monte Carlo simulation: a number of random control datasets are generated under an appropriate null model. They are then clustered (using the same clustering algorithm as applied to the original data) and the resulting partitions are evaluated under I. The average value obtained is taken as an approximation of the expected value Ie. Importantly, different possibilities for the choice of the null model exist, and the specific model used may have a crucial impact on the final outcome (Gordon, 1999). The main lines of distinction are between data-independent null models (e.g. a Poisson model or a Unimodal model) and data-influenced null models (e.g. an ellipsoidal model) (Gordon, 1999). The success of this technique strongly depends on the ability of the null model employed to capture the shape of the data manifold. Yet, it may help to detect those biases that result from a change in the number of clusters or the shape of the underlying data distribution.

The biases of internal measures are illustrated in Figure 7, where the application of several internal measures on the Long dataset is shown. The results obtained by different measures are only partially consistent, which is owing to several factors. First, for most internal measures that assess cluster compactness or ratios of inter-cluster and intra-cluster distances, the shape of the data manifold in this dataset introduces a bias towards a vertical split. This bias could be identified through the comparison with the results for uniformly random control data. Second, the classification of outliers in their own clusters can have a significant impact on the final result of a performance measure, in particular, if minimum or maximum pairwise distances are taken into account (this is the case for the Dunn Index). This problem can only to a certain degree be tackled by the elimination of outliers prior to clustering. Third, more fundamentally, k-means, SOM, SOTA and average link strive for spherically shaped compact clusters and are likely to perform reasonably well under the Silhouette Width even without the discovery of any cluster structure. Fourth, the clusters in the dataset are elongated and the correct partitioning therefore does not score highly under the Silhouette Width (or any other measure based on cluster compactness). For this reason the good performance of single link for k ≥ 5 is not manifested in the plot of the Silhouette Width.

While underlining the benefits of the comparison with a null model, the above example also makes clear that such a step cannot entirely remove the biases of a measure with regard to particular cluster structures. Owing to the conflict between the assumptions of the Silhouette Width and the real cluster structure, it is impossible to identify the correct number of clusters for single link in the Silhouette plot. The results under variance and connectivity contain clues as to the best clustering solution, but the results are largely dominated by the measures' biases towards particular algorithms, making it hard to arrive at the correct conclusion. In the plot of variance for single link, the approximation of the correct solution (for k = 5) manifests itself in a small ‘knee’—yet, the objective values obtained by k-means, SOM, SOTA and average link are far better. Simultaneously, single link largely outperforms k-means, SOM, SOTA and average link under connectivity, and its best solution manifests itself in a weak plateau in the performance curve—yet, connectivity is clearly strongly biased towards single link, and may not be trustworthy. Without additional knowledge, it will not be clear as to which algorithm and solution to choose.

The above data demonstrates the difficulty of selecting one internal validation measure that permits the objective quantification of a range of conceptually different algorithms. Both clustering algorithms and internal validation measures are based on certain assumptions about the cluster structure, which results in biases of measures with regard to specific algorithms (owing to shared underlying assumptions). An understanding of these biases is therefore crucial to ensure a valid assessment of clustering results by means of internal measures. In particular, it is essential to comprehend the working principles of the algorithms and the evaluation measures used and to select a combination of measures and algorithms that permits one to draw meaningful conclusions.

4.2.1 Complementary validation measures
Evidently, type-4-validation techniques like the Silhouette Width constitute an attempt to combine measures and thereby reduce their individual biases. However, as outlined in Section 3.2.4, these existing methods are restricted to one fixed (linear or non-linear) combination of the two measures and may therefore still exhibit strong biases towards one or the other measure (as seen in the previous example). A more rigorous approach is the independent use of two or three complementary measures and the subsequent visualization of solutions in two- or three-objective space. In principle, such plots can be generated for any pair of measures, but they are particularly useful for the visualization of the results obtained using conflicting measures such as compactness versus separation, or compactness versus connectivity. Figure 8 demonstrates this approach using the measures of variance and connectivity.

This visualization has several advantages over traditional performance curves. First, it allows one to summarize information regarding the algorithms' performance under both internal validity measures. Second, the set of solutions returned by the different algorithms can be automatically reduced using the concept of Pareto optimality, and all Pareto optimal solutions can be identified. Third, it clearly demonstrates the behaviour of the different algorithms with respect to the two objectives. Figure 8 reveals both single link's tendency to isolate singleton clusters (reflected in the lack of improvement in variance) and k-means' tendency to partition the data into equally sized chunks without consideration of the underlying data distribution (reflected in the quick deterioration in connectivity). Moreover, the best solution—generated by single link for k = 5—is identifiable here.

4.3 Biases of stability-based techniques
Stability-based techniques employ a less stringent definition of clustering quality than do traditional internal validation techniques and, therefore, do not suffer from the same biases towards particular algorithms. However, while being reliable indicators of clustering quality in many cases, stability-based techniques may also be misleading under certain circumstances. This predominantly concerns their application to datasets in which the shape of the data manifold causes a given clustering algorithm to converge reliably to certain suboptimal solutions. Under such conditions a clustering may appear stable under re-sampling/perturbation, while not corresponding to the real structure of the dataset. Figure 9 demonstrates these issues on the Long and Square dataset. Further issues with stability-based techniques have been pointed out in (Breckenridge, 2000; Krieger and Green, 1999). 


rating: 7.25 from 4 votes | updated on: 31 Oct 2006 | views: 1443 |

Rate article:







excellent!bad…