Scalable and efficient learning from crowds with Gaussian processes

Scalable and efficient learning from crowds with Gaussian processes. Morales-Álvarez, P., Ruiz, P., Santos-Rodríguez, R., Molina, R., & Katsaggelos, A. K. Information Fusion, 52:110–127, dec, 2019.

Paper doi abstract bibtex

Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any dataset size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. They stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.

@article{Pablo2019,
abstract = {Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any dataset size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. They stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.},
author = {Morales-{\'{A}}lvarez, Pablo and Ruiz, Pablo and Santos-Rodr{\'{i}}guez, Ra{\'{u}}l and Molina, Rafael and Katsaggelos, Aggelos K.},
doi = {10.1016/j.inffus.2018.12.008},
issn = {15662535},
journal = {Information Fusion},
keywords = {Bayesian modelling,Classification,Fourier features,Gaussian processes,Scalable crowdsourcing,Variational inference},
month = {dec},
pages = {110--127},
title = {{Scalable and efficient learning from crowds with Gaussian processes}},
url = {https://linkinghub.elsevier.com/retrieve/pii/S1566253518304664},
volume = {52},
year = {2019}
}

Downloads: 0

{"_id":"6FZtTHbgChXfrCcLZ","bibbaseid":"moraleslvarez-ruiz-santosrodriguez-molina-katsaggelos-scalableandefficientlearningfromcrowdswithgaussianprocesses-2019","author_short":["Morales-Álvarez, P.","Ruiz, P.","Santos-Rodríguez, R.","Molina, R.","Katsaggelos, A. K."],"bibdata":{"bibtype":"article","type":"article","abstract":"Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any dataset size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. They stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.","author":[{"propositions":[],"lastnames":["Morales-Álvarez"],"firstnames":["Pablo"],"suffixes":[]},{"propositions":[],"lastnames":["Ruiz"],"firstnames":["Pablo"],"suffixes":[]},{"propositions":[],"lastnames":["Santos-Rodríguez"],"firstnames":["Raúl"],"suffixes":[]},{"propositions":[],"lastnames":["Molina"],"firstnames":["Rafael"],"suffixes":[]},{"propositions":[],"lastnames":["Katsaggelos"],"firstnames":["Aggelos","K."],"suffixes":[]}],"doi":"10.1016/j.inffus.2018.12.008","issn":"15662535","journal":"Information Fusion","keywords":"Bayesian modelling,Classification,Fourier features,Gaussian processes,Scalable crowdsourcing,Variational inference","month":"dec","pages":"110–127","title":"Scalable and efficient learning from crowds with Gaussian processes","url":"https://linkinghub.elsevier.com/retrieve/pii/S1566253518304664","volume":"52","year":"2019","bibtex":"@article{Pablo2019,\nabstract = {Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any dataset size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. They stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.},\nauthor = {Morales-{\\'{A}}lvarez, Pablo and Ruiz, Pablo and Santos-Rodr{\\'{i}}guez, Ra{\\'{u}}l and Molina, Rafael and Katsaggelos, Aggelos K.},\ndoi = {10.1016/j.inffus.2018.12.008},\nissn = {15662535},\njournal = {Information Fusion},\nkeywords = {Bayesian modelling,Classification,Fourier features,Gaussian processes,Scalable crowdsourcing,Variational inference},\nmonth = {dec},\npages = {110--127},\ntitle = {{Scalable and efficient learning from crowds with Gaussian processes}},\nurl = {https://linkinghub.elsevier.com/retrieve/pii/S1566253518304664},\nvolume = {52},\nyear = {2019}\n}\n","author_short":["Morales-Álvarez, P.","Ruiz, P.","Santos-Rodríguez, R.","Molina, R.","Katsaggelos, A. K."],"key":"Pablo2019","id":"Pablo2019","bibbaseid":"moraleslvarez-ruiz-santosrodriguez-molina-katsaggelos-scalableandefficientlearningfromcrowdswithgaussianprocesses-2019","role":"author","urls":{"Paper":"https://linkinghub.elsevier.com/retrieve/pii/S1566253518304664"},"keyword":["Bayesian modelling","Classification","Fourier features","Gaussian processes","Scalable crowdsourcing","Variational inference"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://sites.northwestern.edu/ivpl/files/2023/06/IVPL_Updated_publications-1.bib","dataSources":["E6Bth2QB5BYjBMZE7","nbnEjsN7MJhurAK9x","PNQZj6FjzoxxJk4Yi","7FpDWDGJ4KgpDiGfB","bod9ms4MQJHuJgPpp","QR9t5P2cLdJuzhfzK","D8k2SxfC5dKNRFgro","7Dwzbxq93HWrJEhT6","qhF8zxmGcJfvtdeAg","fvDEHD49E2ZRwE3fb","H7crv8NWhZup4d4by","DHqokWsryttGh7pJE","vRJd4wNg9HpoZSMHD","sYxQ6pxFgA59JRhxi","w2WahSbYrbcCKBDsC","ya2CyA73rpZseyrZ8","XasdXLL99y5rygCmq","3gkSihZQRfAD2KBo3","t5XMbyZbtPBo4wBGS","bEpHM2CtrwW2qE8FP","teJzFLHexaz5AQW5z"],"keywords":["bayesian modelling","classification","fourier features","gaussian processes","scalable crowdsourcing","variational inference"],"search_terms":["scalable","efficient","learning","crowds","gaussian","processes","morales-álvarez","ruiz","santos-rodríguez","molina","katsaggelos"],"title":"Scalable and efficient learning from crowds with Gaussian processes","year":2019}