Stacked Generalization for Overlapping Asymmetric Datasets. McTeer, M. & Missier, P. In Ordonez, C., Sperlì, Giancarlo, Masciari, E., & Bellatreche, L., editors, Model and Data Engineering, pages 38–52, Cham, 2025. Springer Nature Switzerland. abstract bibtex In the context of training sets for Machine Learning, we use the term Overlapping Asymmetric Datasets (OADs) to refer to a combination of data shapes where a large number of observations (Vertical data, $}{$\mathcal \V\$}{$V) are described using only few features (x), and a small subset of the observations (Horizontal data, $}{$\mathcal \H\$}{$H) are described by a larger number of features (x plus some new z). A common example of such a combination is a healthcare dataset where the majority of patients are described using a baseline set of clinical and socio-demographic features, and a handful of those patients have a richer characterisation, having undergone further testing . Given a classification task, a model trained solely on $}{$\mathcal \H\$}{$Hwill benefit from the many features, but its performance will be limited by a small training set size . In this paper we study the problem of maximising model performance on $}{$\mathcal \H\$}{$H, by leveraging the additional information available from $}{$\mathcal \V\$}{$V. Our approach is based on the notions of stacked generalization and meta-learning, where the predictions generated by an ensemble of weak classifiers for $}{$\mathcal \V\$}{$Vare fed into a second-tier meta-learner, where the z features are also used. We conduct extensive experiments to explore the benefits of this approach over a range of dataset configurations. The results suggest that stacking improves model performance, while using z features only provides modest improvements. This may have practical implications as it suggests that in some settings, the effort involved in acquiring the additional z features is not always justified.
@InProceedings{10.1007/978-3-031-87719-3_3,
author="McTeer, Matthew
and Missier, Paolo",
editor="Ordonez, Carlos
and Sperl{\`i}, Giancarlo
and Masciari, Elio
and Bellatreche, Ladjel",
title="Stacked Generalization for Overlapping Asymmetric Datasets",
booktitle="Model and Data Engineering",
year="2025",
publisher="Springer Nature Switzerland",
address="Cham",
pages="38--52",
abstract="In the context of training sets for Machine Learning, we use the term Overlapping Asymmetric Datasets (OADs) to refer to a combination of data shapes where a large number of observations (Vertical data, {\$}{\$}{\backslash}mathcal {\{}V{\}}{\$}{\$}V) are described using only few features (x), and a small subset of the observations (Horizontal data, {\$}{\$}{\backslash}mathcal {\{}H{\}}{\$}{\$}H) are described by a larger number of features (x plus some new z). A common example of such a combination is a healthcare dataset where the majority of patients are described using a baseline set of clinical and socio-demographic features, and a handful of those patients have a richer characterisation, having undergone further testing . Given a classification task, a model trained solely on {\$}{\$}{\backslash}mathcal {\{}H{\}}{\$}{\$}Hwill benefit from the many features, but its performance will be limited by a small training set size . In this paper we study the problem of maximising model performance on {\$}{\$}{\backslash}mathcal {\{}H{\}}{\$}{\$}H, by leveraging the additional information available from {\$}{\$}{\backslash}mathcal {\{}V{\}}{\$}{\$}V. Our approach is based on the notions of stacked generalization and meta-learning, where the predictions generated by an ensemble of weak classifiers for {\$}{\$}{\backslash}mathcal {\{}V{\}}{\$}{\$}Vare fed into a second-tier meta-learner, where the z features are also used. We conduct extensive experiments to explore the benefits of this approach over a range of dataset configurations. The results suggest that stacking improves model performance, while using z features only provides modest improvements. This may have practical implications as it suggests that in some settings, the effort involved in acquiring the additional z features is not always justified.",
isbn="978-3-031-87719-3"
}
Downloads: 0
{"_id":"BJBr5BCoffrNWr9X4","bibbaseid":"mcteer-missier-stackedgeneralizationforoverlappingasymmetricdatasets-2025","author_short":["McTeer, M.","Missier, P."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"propositions":[],"lastnames":["McTeer"],"firstnames":["Matthew"],"suffixes":[]},{"propositions":[],"lastnames":["Missier"],"firstnames":["Paolo"],"suffixes":[]}],"editor":[{"propositions":[],"lastnames":["Ordonez"],"firstnames":["Carlos"],"suffixes":[]},{"firstnames":[],"propositions":[],"lastnames":["Sperlì, Giancarlo"],"suffixes":[]},{"propositions":[],"lastnames":["Masciari"],"firstnames":["Elio"],"suffixes":[]},{"propositions":[],"lastnames":["Bellatreche"],"firstnames":["Ladjel"],"suffixes":[]}],"title":"Stacked Generalization for Overlapping Asymmetric Datasets","booktitle":"Model and Data Engineering","year":"2025","publisher":"Springer Nature Switzerland","address":"Cham","pages":"38–52","abstract":"In the context of training sets for Machine Learning, we use the term Overlapping Asymmetric Datasets (OADs) to refer to a combination of data shapes where a large number of observations (Vertical data, $}{$\\mathcal \\V\\$}{$V) are described using only few features (x), and a small subset of the observations (Horizontal data, $}{$\\mathcal \\H\\$}{$H) are described by a larger number of features (x plus some new z). A common example of such a combination is a healthcare dataset where the majority of patients are described using a baseline set of clinical and socio-demographic features, and a handful of those patients have a richer characterisation, having undergone further testing . Given a classification task, a model trained solely on $}{$\\mathcal \\H\\$}{$Hwill benefit from the many features, but its performance will be limited by a small training set size . In this paper we study the problem of maximising model performance on $}{$\\mathcal \\H\\$}{$H, by leveraging the additional information available from $}{$\\mathcal \\V\\$}{$V. Our approach is based on the notions of stacked generalization and meta-learning, where the predictions generated by an ensemble of weak classifiers for $}{$\\mathcal \\V\\$}{$Vare fed into a second-tier meta-learner, where the z features are also used. We conduct extensive experiments to explore the benefits of this approach over a range of dataset configurations. The results suggest that stacking improves model performance, while using z features only provides modest improvements. This may have practical implications as it suggests that in some settings, the effort involved in acquiring the additional z features is not always justified.","isbn":"978-3-031-87719-3","bibtex":"@InProceedings{10.1007/978-3-031-87719-3_3,\nauthor=\"McTeer, Matthew\nand Missier, Paolo\",\neditor=\"Ordonez, Carlos\nand Sperl{\\`i}, Giancarlo\nand Masciari, Elio\nand Bellatreche, Ladjel\",\ntitle=\"Stacked Generalization for Overlapping Asymmetric Datasets\",\nbooktitle=\"Model and Data Engineering\",\nyear=\"2025\",\npublisher=\"Springer Nature Switzerland\",\naddress=\"Cham\",\npages=\"38--52\",\nabstract=\"In the context of training sets for Machine Learning, we use the term Overlapping Asymmetric Datasets (OADs) to refer to a combination of data shapes where a large number of observations (Vertical data, {\\$}{\\$}{\\backslash}mathcal {\\{}V{\\}}{\\$}{\\$}V) are described using only few features (x), and a small subset of the observations (Horizontal data, {\\$}{\\$}{\\backslash}mathcal {\\{}H{\\}}{\\$}{\\$}H) are described by a larger number of features (x plus some new z). A common example of such a combination is a healthcare dataset where the majority of patients are described using a baseline set of clinical and socio-demographic features, and a handful of those patients have a richer characterisation, having undergone further testing . Given a classification task, a model trained solely on {\\$}{\\$}{\\backslash}mathcal {\\{}H{\\}}{\\$}{\\$}Hwill benefit from the many features, but its performance will be limited by a small training set size . In this paper we study the problem of maximising model performance on {\\$}{\\$}{\\backslash}mathcal {\\{}H{\\}}{\\$}{\\$}H, by leveraging the additional information available from {\\$}{\\$}{\\backslash}mathcal {\\{}V{\\}}{\\$}{\\$}V. Our approach is based on the notions of stacked generalization and meta-learning, where the predictions generated by an ensemble of weak classifiers for {\\$}{\\$}{\\backslash}mathcal {\\{}V{\\}}{\\$}{\\$}Vare fed into a second-tier meta-learner, where the z features are also used. We conduct extensive experiments to explore the benefits of this approach over a range of dataset configurations. The results suggest that stacking improves model performance, while using z features only provides modest improvements. This may have practical implications as it suggests that in some settings, the effort involved in acquiring the additional z features is not always justified.\",\nisbn=\"978-3-031-87719-3\"\n}\n\n\n\n","author_short":["McTeer, M.","Missier, P."],"editor_short":["Ordonez, C.","Sperlì, Giancarlo","Masciari, E.","Bellatreche, L."],"key":"10.1007/978-3-031-87719-3_3","id":"10.1007/978-3-031-87719-3_3","bibbaseid":"mcteer-missier-stackedgeneralizationforoverlappingasymmetricdatasets-2025","role":"author","urls":{},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/f/MTSG9SdhWPisKNpZX/MyPublications-bibbase.bib","dataSources":["ze2X9uz8Dcv2oGipf","afppXLgSuddAzAL9e"],"keywords":[],"search_terms":["stacked","generalization","overlapping","asymmetric","datasets","mcteer","missier"],"title":"Stacked Generalization for Overlapping Asymmetric Datasets","year":2025}