Login

Join for Free!
17716 members
table of contents table of contents

This review paper aims to familiarize the reader with the battery of …


Biology Articles » Bioinformatics » Computational cluster validation in post-genomic data analysis » Guidelines for effective cluster validation

Guidelines for effective cluster validation
- Computational cluster validation in post-genomic data analysis

5. GUIDELINES FOR EFFECTIVE CLUSTER VALIDATION 

In the previous section, the strengths and weaknesses of different validation techniques have been discussed. Two sample datasets were used to demonstrate that the results returned by individual validation techniques can be biased and misleading under certain circumstances, but also that there exist means of detecting several of these biases. Ultimately, despite their imperfections, validation measures do provide significant amounts of information that cannot be obtained using visual inspection alone. Different and complementary validation tools exist, and the use of a set of such tools can minimize the risk of misinterpreting results, and thereby maximize confidence in the results obtained.

Cluster analysis is a complicated interactive process, which makes it impossible to provide an entirely clear-cut prescription on how to do clustering or to perform cluster validation. In general, the experimental set-up should fundamentally differ depending on the primary aim of a study. Cluster validation aimed at the evaluation of a novel algorithm or the comparison of several algorithms should be quite different to the type of cluster validation used during the analysis of a novel biological dataset. This section attempts to give some general guidelines on the conduct of an effective cluster validation in both scenarios.

5.1 Cluster validation for the evaluation/comparisonof algorithms
When evaluating algorithms, the choice of datasets is a primary issue. Certainly, several datasets should be used, not just one—especially not only the dataset the algorithm was initially developed on. It is fundamental to appreciate that algorithms make different assumptions about the cluster structures, and are, consequently, more or less suited for particular datasets: no single algorithm can therefore be expected to perform well for all types of data (Gordon, 1999). Thus, the aim of any evaluation study should not be to show that a particular algorithm is the best overall, but to show what the particular strengths and weaknesses of a given algorithm are. For this purpose, it is important to test on benchmarks with interesting known data properties. In this scenario, two types of questions are then of interest, which are both essential to understand fully the outcome of an experiment.

  • How well does the algorithm perform on a given dataset? On benchmarks, this type of question can be objectively answered using external cluster validation. The use of adjusted validity measures is preferable.
  • Why is the algorithm not performing well? What is going wrong? Internal validation technique can be used to highlight these issues, particularly those of Type 1, Type 2 and Type 3, and their combination in Pareto plots, as these have straightforward interpretations in terms of data properties.

5.2 Cluster validation for a novel dataset
When clustering a novel biological dataset, cluster validation plays a very different role. A completely objective validation of cluster quality is usually impossible in such a case, but the use of cluster validation at different steps during the clustering process can help to improve the quality of results, and increase the confidence in the final result. Cluster analysis usually involves a first exploratory step, where the data are visualized (projected) to two- or three-dimensions (using methods such as principal components analysis or multi-dimensional scaling) in order to check for clustering tendencies. At this stage, a statistical test of clustering tendency (see Section 3.4) may help to quantify the visual impressions obtained.

No entirely reliable method exists to identify the number of clusters in a dataset, and the choice of the best number of clusters may well depend on the clustering method used. A cluster analysis should therefore always be performed for a (sensible) range of different numbers of clusters. Access to such a sequence of solutions is essential to understand the operation of a clustering algorithm and to identify trends in the data.

The core cluster analysis should preferably be conducted using several conceptually distinct clustering algorithms, i.e. algorithms that are not biased towards the same type of clusters. Binary external indices can then be used to quantify analytically the similarity between clustering results (including those with different numbers of clusters). If conceptually different algorithms generate highly similar partitions, this is a good indicator that actual structure has been discovered. On the other hand, coinciding clustering results returned by k-means, partitioning-around medoids or SOMs are less significant, as these algorithms share many concepts. If the partitions generated by different algorithms are highly dissimilar this is often an indication of poor structure in the data, and may point to defects in the pre-processing. In high-dimensional biological data, the structures in the data cannot often be perceived in the full feature space, and a drastic reduction of variables may be necessary in order to reduce the impact of noise (Shaw et al., 1997). This process of feature selection is often necessary but should preferably be based on unsupervised methods (e.g. by selecting the variables with the highest variation across the dataset). If the features are selected using the knowledge of the real class labels (e.g. by selecting the variables which are best correlated with the known class structure), a subsequent cluster analysis will trivially yield the desired result (even for random data).

Internal validation measures should be used in addition to the above to provide feedback on the quality of the data and to check whether a given partitioning is justified in terms of the underlying data distribution. Here, it is important to use measures of the different basic types, Type 1, Type 2 and Type 3, and to check how well the solutions perform under each of them. A good clustering solution tends to perform reasonably well under multiple measures (Handl and Knowles, 2005). If a solution performs well only under one of them, this is likely to be an artefact of the biases of the employed algorithm. Type 4 measures and plots in two-objective space may be a valuable tool in identifying solutions that perform consistently well. Given the noisy nature of biological data, robust measures like the Silhouette Width are generally preferable to noise-sensitive measures such as the Dunn Index.

Owing to the many sources of noise and the high dimensionality of the data, the above internal validation techniques on their own may often be insufficient in biological data analysis. Frequently, the most conspicuous structure in the data may be artefacts due to experimental factors. On the one hand, cluster analysis can be a valuable tool in identifying such artefacts. On the other hand, the artefacts will ultimately have to be removed if a researcher is interested in biologically meaningful results. Towards this goal, external unary measures can be applied to assess the degree of preservation of replicate-relationships, or of prior biological knowledge. This information can then provide additional feedback on the quality of the data and of previous pre-processing steps. A good final clustering result will ideally combine validity under both internal and external measures, i.e., it will exhibit a distinct underlying cluster structure while being consistent with prior biological knowledge.


rating: 7.25 from 4 votes | updated on: 31 Oct 2006 | views: 1439 |

Rate article:







excellent!bad…