Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Rousseeuw, P., J.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis [pdf]Paper  Silhouettes: a graphical aid to the interpretation and validation of cluster analysis [link]Website  abstract   bibtex   
A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects he well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an 'appropriate' number of clusters. 1. The need for graphical displays There are many algorithms for partitioning a set of objects into k clusters, such as the k-means method [6,9,13] and the k-median approach [20]. The result of such a partitioning technique is a list of clusters with their objects, which is not as visually appealing as the dendrograms of hierarchical methods. It is hoped that the graphical display introduced in Section 2 will contribute to the interpretation of cluster analysis results, as illustrated by the examples of Section 3. In Section 4, some other displays will be described. Suppose there are n objects to be clustered, which may be persons, flowers, cases, statistical variables, or whatever. Clustering algorithms mainly operate on two frequently used input data structures (see [18, Chapters 1 and 21). The first method is to represent the objects by means of a collection of measurements or attributes, such as height, weight, sex, color, and so on. In Tucker's [19] terminology such an objects by attributes matrix is called two-mode, since the row and column entities are different. When the measurements are on an interval scale, one can compute the Euclidean distance d(i, j) between any objects i and j. This leads us to the second data structure, namely a collection of proximities which must be available for all pairs of objects. This corresponds to a one-mode matrix, since the row and column entities are the same set of objects. We shall consider two types of proximities: dissimilarities (which measure how far away two objects are from each other) and similarities 0377-0427/87/$3.50 0 1987, Elsevier Science Publishers B.V. (North-Holland)

Downloads: 0