Regularization and global optimization in model-based clustering

Regularization and global optimization in model-based clustering. Sampaio, R., Dias Garcia, J., Poggi, M., & Vidal, T. Technical Report ArXiV: 2302.02450, 2023.

Paper abstract bibtex

Due to their conceptual simplicity, k-means algorithm variants have been extensively used for unsupervised cluster analysis. However, one main shortcoming of these algorithms is that they essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. In comparison, general Gaussian Mixture Models (GMMs) can fit richer structures but require estimating a quadratic number of parameters per cluster to represent the covariance matrices. This poses two main issues: (i) the underlying optimization problems are challenging due to their larger number of local minima, and (ii) their solutions can overfit the data. In this work, we design search strategies that circumvent both issues. We develop efficient global optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. Through extensive computational analyses, we observe that global optimization or regularization in isolation does not substantially improve cluster recovery. However, combining these techniques permits a completely new level of performance previously unachieved by k-means algorithm variants, unraveling vastly different cluster structures. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration. To facilitate such applications, we provide open-source code as well as Julia packages (UnsupervisedClustering.jl and RegularizedCovarianceMatrices.jl) implementing the proposed techniques.

@techreport{Sampaio2023,
abstract = {Due to their conceptual simplicity, k-means algorithm variants have been extensively used for unsupervised cluster analysis. However, one main shortcoming of these algorithms is that they essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. In comparison, general Gaussian Mixture Models (GMMs) can fit richer structures but require estimating a quadratic number of parameters per cluster to represent the covariance matrices. This poses two main issues: (i) the underlying optimization problems are challenging due to their larger number of local minima, and (ii) their solutions can overfit the data. In this work, we design search strategies that circumvent both issues. We develop efficient global optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. Through extensive computational analyses, we observe that global optimization or regularization in isolation does not substantially improve cluster recovery. However, combining these techniques permits a completely new level of performance previously unachieved by k-means algorithm variants, unraveling vastly different cluster structures. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration. To facilitate such applications, we provide open-source code as well as Julia packages (UnsupervisedClustering.jl and RegularizedCovarianceMatrices.jl) implementing the proposed techniques.},
archivePrefix = {arXiv},
arxivId = {2302.02450},
author = {Sampaio, R.A. and {Dias Garcia}, J. and Poggi, M. and Vidal, T.},
eprint = {2302.02450},
file = {:C$\backslash$:/Users/Thibaut/Documents/Mendeley-Articles/Sampaio et al/Sampaio et al. - 2023 - Regularization and global optimization in model-based clustering.pdf:pdf},
institution = {ArXiV: 2302.02450},
title = {{Regularization and global optimization in model-based clustering}},
url = {https://arxiv.org/pdf/2302.02450.pdf},
year = {2023}
}

Downloads: 0

{"_id":"CJxcEHKFTxrQC9iRC","bibbaseid":"sampaio-diasgarcia-poggi-vidal-regularizationandglobaloptimizationinmodelbasedclustering-2023","author_short":["Sampaio, R.","Dias Garcia, J.","Poggi, M.","Vidal, T."],"bibdata":{"bibtype":"techreport","type":"techreport","abstract":"Due to their conceptual simplicity, k-means algorithm variants have been extensively used for unsupervised cluster analysis. However, one main shortcoming of these algorithms is that they essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. In comparison, general Gaussian Mixture Models (GMMs) can fit richer structures but require estimating a quadratic number of parameters per cluster to represent the covariance matrices. This poses two main issues: (i) the underlying optimization problems are challenging due to their larger number of local minima, and (ii) their solutions can overfit the data. In this work, we design search strategies that circumvent both issues. We develop efficient global optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. Through extensive computational analyses, we observe that global optimization or regularization in isolation does not substantially improve cluster recovery. However, combining these techniques permits a completely new level of performance previously unachieved by k-means algorithm variants, unraveling vastly different cluster structures. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration. To facilitate such applications, we provide open-source code as well as Julia packages (UnsupervisedClustering.jl and RegularizedCovarianceMatrices.jl) implementing the proposed techniques.","archiveprefix":"arXiv","arxivid":"2302.02450","author":[{"propositions":[],"lastnames":["Sampaio"],"firstnames":["R.A."],"suffixes":[]},{"propositions":[],"lastnames":["Dias Garcia"],"firstnames":["J."],"suffixes":[]},{"propositions":[],"lastnames":["Poggi"],"firstnames":["M."],"suffixes":[]},{"propositions":[],"lastnames":["Vidal"],"firstnames":["T."],"suffixes":[]}],"eprint":"2302.02450","file":":C$\\$:/Users/Thibaut/Documents/Mendeley-Articles/Sampaio et al/Sampaio et al. - 2023 - Regularization and global optimization in model-based clustering.pdf:pdf","institution":"ArXiV: 2302.02450","title":"Regularization and global optimization in model-based clustering","url":"https://arxiv.org/pdf/2302.02450.pdf","year":"2023","bibtex":"@techreport{Sampaio2023,\nabstract = {Due to their conceptual simplicity, k-means algorithm variants have been extensively used for unsupervised cluster analysis. However, one main shortcoming of these algorithms is that they essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. In comparison, general Gaussian Mixture Models (GMMs) can fit richer structures but require estimating a quadratic number of parameters per cluster to represent the covariance matrices. This poses two main issues: (i) the underlying optimization problems are challenging due to their larger number of local minima, and (ii) their solutions can overfit the data. In this work, we design search strategies that circumvent both issues. We develop efficient global optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. Through extensive computational analyses, we observe that global optimization or regularization in isolation does not substantially improve cluster recovery. However, combining these techniques permits a completely new level of performance previously unachieved by k-means algorithm variants, unraveling vastly different cluster structures. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration. To facilitate such applications, we provide open-source code as well as Julia packages (UnsupervisedClustering.jl and RegularizedCovarianceMatrices.jl) implementing the proposed techniques.},\narchivePrefix = {arXiv},\narxivId = {2302.02450},\nauthor = {Sampaio, R.A. and {Dias Garcia}, J. and Poggi, M. and Vidal, T.},\neprint = {2302.02450},\nfile = {:C$\\backslash$:/Users/Thibaut/Documents/Mendeley-Articles/Sampaio et al/Sampaio et al. - 2023 - Regularization and global optimization in model-based clustering.pdf:pdf},\ninstitution = {ArXiV: 2302.02450},\ntitle = {{Regularization and global optimization in model-based clustering}},\nurl = {https://arxiv.org/pdf/2302.02450.pdf},\nyear = {2023}\n}\n","author_short":["Sampaio, R.","Dias Garcia, J.","Poggi, M.","Vidal, T."],"key":"Sampaio2023","id":"Sampaio2023","bibbaseid":"sampaio-diasgarcia-poggi-vidal-regularizationandglobaloptimizationinmodelbasedclustering-2023","role":"author","urls":{"Paper":"https://arxiv.org/pdf/2302.02450.pdf"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"techreport","biburl":"https://w1.cirrelt.ca/~vidalt/resources/My Collection.bib","dataSources":["yinfondEAJRbDM9sJ","sempRA6PhmAdGk3yG"],"keywords":[],"search_terms":["regularization","global","optimization","model","based","clustering","sampaio","dias garcia","poggi","vidal"],"title":"Regularization and global optimization in model-based clustering","year":2023}