Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?. Vinh, N., X., Epps, J., & Bailey, J.
Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? [pdf]Paper  Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? [link]Website  abstract   bibtex   
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the ne-cessity of correction for chance for informa-tion theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the num-ber of data points and the number of clus-ters is small. This effect is similar in some other non-information theoretic based mea-sures such as the well-known Rand Index. Assuming a hypergeometric model of ran-domness, we derive the analytical formula for the expected mutual information value be-tween a pair of clusterings, and then propose the adjusted version for several popular in-formation theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.

Downloads: 0