Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?

Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?. Vinh, N., X., Epps, J., & Bailey, J.

Paper

Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? [link]

Website abstract bibtex

Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the ne-cessity of correction for chance for informa-tion theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the num-ber of data points and the number of clus-ters is small. This effect is similar in some other non-information theoretic based mea-sures such as the well-known Rand Index. Assuming a hypergeometric model of ran-domness, we derive the analytical formula for the expected mutual information value be-tween a pair of clusterings, and then propose the adjusted version for several popular in-formation theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.

@article{
 title = {Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?},
 type = {article},
 websites = {http://delivery.acm.org/10.1145/1560000/1553511/p1073-vinh.pdf?ip=128.227.12.255&id=1553511&acc=ACTIVE%20SERVICE&key=5CC3CBFF4617FD07%2EC2A817F22E85290F%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1517859494_f78e1018d77795d4fdd37eb5f21670e6},
 id = {369d0773-d06c-3c79-b825-d6458831feca},
 created = {2018-02-05T19:33:23.546Z},
 accessed = {2018-02-05},
 file_attached = {true},
 profile_id = {371589bb-c770-37ff-8193-93c6f25ffeb1},
 group_id = {f982cd63-7ceb-3aa2-ac7e-a953963d6716},
 last_modified = {2018-02-05T19:33:26.048Z},
 read = {false},
 starred = {false},
 authored = {false},
 confirmed = {false},
 hidden = {false},
 private_publication = {false},
 abstract = {Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the ne-cessity of correction for chance for informa-tion theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the num-ber of data points and the number of clus-ters is small. This effect is similar in some other non-information theoretic based mea-sures such as the well-known Rand Index. Assuming a hypergeometric model of ran-domness, we derive the analytical formula for the expected mutual information value be-tween a pair of clusterings, and then propose the adjusted version for several popular in-formation theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.},
 bibtype = {article},
 author = {Vinh, Nguyen Xuan and Epps, Julien and Bailey, James}
}

Downloads: 0

{"_id":"hn3ZPuPpqkQJdPZ2u","bibbaseid":"vinh-epps-bailey-informationtheoreticmeasuresforclusteringscomparisonisacorrectionforchancenecessary","downloads":0,"creationDate":"2018-02-07T16:22:57.352Z","title":"Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?","author_short":["Vinh, N., X.","Epps, J.","Bailey, J."],"year":null,"bibtype":"article","biburl":null,"bibdata":{"title":"Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?","type":"article","websites":"http://delivery.acm.org/10.1145/1560000/1553511/p1073-vinh.pdf?ip=128.227.12.255&id=1553511&acc=ACTIVE%20SERVICE&key=5CC3CBFF4617FD07%2EC2A817F22E85290F%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1517859494_f78e1018d77795d4fdd37eb5f21670e6","id":"369d0773-d06c-3c79-b825-d6458831feca","created":"2018-02-05T19:33:23.546Z","accessed":"2018-02-05","file_attached":"true","profile_id":"371589bb-c770-37ff-8193-93c6f25ffeb1","group_id":"f982cd63-7ceb-3aa2-ac7e-a953963d6716","last_modified":"2018-02-05T19:33:26.048Z","read":false,"starred":false,"authored":false,"confirmed":false,"hidden":false,"private_publication":false,"abstract":"Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the ne-cessity of correction for chance for informa-tion theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the num-ber of data points and the number of clus-ters is small. This effect is similar in some other non-information theoretic based mea-sures such as the well-known Rand Index. Assuming a hypergeometric model of ran-domness, we derive the analytical formula for the expected mutual information value be-tween a pair of clusterings, and then propose the adjusted version for several popular in-formation theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.","bibtype":"article","author":"Vinh, Nguyen Xuan and Epps, Julien and Bailey, James","bibtex":"@article{\n title = {Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?},\n type = {article},\n websites = {http://delivery.acm.org/10.1145/1560000/1553511/p1073-vinh.pdf?ip=128.227.12.255&id=1553511&acc=ACTIVE%20SERVICE&key=5CC3CBFF4617FD07%2EC2A817F22E85290F%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1517859494_f78e1018d77795d4fdd37eb5f21670e6},\n id = {369d0773-d06c-3c79-b825-d6458831feca},\n created = {2018-02-05T19:33:23.546Z},\n accessed = {2018-02-05},\n file_attached = {true},\n profile_id = {371589bb-c770-37ff-8193-93c6f25ffeb1},\n group_id = {f982cd63-7ceb-3aa2-ac7e-a953963d6716},\n last_modified = {2018-02-05T19:33:26.048Z},\n read = {false},\n starred = {false},\n authored = {false},\n confirmed = {false},\n hidden = {false},\n private_publication = {false},\n abstract = {Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the ne-cessity of correction for chance for informa-tion theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the num-ber of data points and the number of clus-ters is small. This effect is similar in some other non-information theoretic based mea-sures such as the well-known Rand Index. Assuming a hypergeometric model of ran-domness, we derive the analytical formula for the expected mutual information value be-tween a pair of clusterings, and then propose the adjusted version for several popular in-formation theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.},\n bibtype = {article},\n author = {Vinh, Nguyen Xuan and Epps, Julien and Bailey, James}\n}","author_short":["Vinh, N., X.","Epps, J.","Bailey, J."],"urls":{"Paper":"http://bibbase.org/service/mendeley/371589bb-c770-37ff-8193-93c6f25ffeb1/file/4892bf48-9164-9c7d-2606-23eae5a4ddd7/Information_Theoretic_Measures_for_Clusterings_Comparison_Is_a_Correction_for_Chance_Necessary.pdf.pdf","Website":"http://delivery.acm.org/10.1145/1560000/1553511/p1073-vinh.pdf?ip=128.227.12.255&id=1553511&acc=ACTIVE%20SERVICE&key=5CC3CBFF4617FD07%2EC2A817F22E85290F%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1517859494_f78e1018d77795d4fdd37eb5f21670e6"},"bibbaseid":"vinh-epps-bailey-informationtheoreticmeasuresforclusteringscomparisonisacorrectionforchancenecessary","role":"author","downloads":0},"search_terms":["information","theoretic","measures","clusterings","comparison","correction","chance","necessary","vinh","epps","bailey"],"keywords":[],"authorIDs":[]}