Paper Website abstract bibtex
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the ne-cessity of correction for chance for informa-tion theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the num-ber of data points and the number of clus-ters is small. This effect is similar in some other non-information theoretic based mea-sures such as the well-known Rand Index. Assuming a hypergeometric model of ran-domness, we derive the analytical formula for the expected mutual information value be-tween a pair of clusterings, and then propose the adjusted version for several popular in-formation theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.