Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora. Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., & Cotterell, R. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–6, Singapore, 2023. Association for Computational Linguistics.

Paper doi abstract bibtex

Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.

@inproceedings{warstadt_findings_2023,
	address = {Singapore},
	title = {Findings of the {BabyLM} {Challenge}: {Sample}-{Efficient} {Pretraining} on {Developmentally} {Plausible} {Corpora}},
	shorttitle = {Findings of the {BabyLM} {Challenge}},
	url = {https://aclanthology.org/2023.conll-babylm.1},
	doi = {10.18653/v1/2023.conll-babylm.1},
	abstract = {Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.},
	language = {en},
	urldate = {2024-01-16},
	booktitle = {Proceedings of the {BabyLM} {Challenge} at the 27th {Conference} on {Computational} {Natural} {Language} {Learning}},
	publisher = {Association for Computational Linguistics},
	author = {Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan},
	year = {2023},
	pages = {1--6},
}

Downloads: 0

{"_id":"gJqphWLioY4iK8En7","bibbaseid":"warstadt-mueller-choshen-wilcox-zhuang-ciro-mosquera-paranjabe-etal-findingsofthebabylmchallengesampleefficientpretrainingondevelopmentallyplausiblecorpora-2023","author_short":["Warstadt, A.","Mueller, A.","Choshen, L.","Wilcox, E.","Zhuang, C.","Ciro, J.","Mosquera, R.","Paranjabe, B.","Williams, A.","Linzen, T.","Cotterell, R."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Singapore","title":"Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora","shorttitle":"Findings of the BabyLM Challenge","url":"https://aclanthology.org/2023.conll-babylm.1","doi":"10.18653/v1/2023.conll-babylm.1","abstract":"Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.","language":"en","urldate":"2024-01-16","booktitle":"Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning","publisher":"Association for Computational Linguistics","author":[{"propositions":[],"lastnames":["Warstadt"],"firstnames":["Alex"],"suffixes":[]},{"propositions":[],"lastnames":["Mueller"],"firstnames":["Aaron"],"suffixes":[]},{"propositions":[],"lastnames":["Choshen"],"firstnames":["Leshem"],"suffixes":[]},{"propositions":[],"lastnames":["Wilcox"],"firstnames":["Ethan"],"suffixes":[]},{"propositions":[],"lastnames":["Zhuang"],"firstnames":["Chengxu"],"suffixes":[]},{"propositions":[],"lastnames":["Ciro"],"firstnames":["Juan"],"suffixes":[]},{"propositions":[],"lastnames":["Mosquera"],"firstnames":["Rafael"],"suffixes":[]},{"propositions":[],"lastnames":["Paranjabe"],"firstnames":["Bhargavi"],"suffixes":[]},{"propositions":[],"lastnames":["Williams"],"firstnames":["Adina"],"suffixes":[]},{"propositions":[],"lastnames":["Linzen"],"firstnames":["Tal"],"suffixes":[]},{"propositions":[],"lastnames":["Cotterell"],"firstnames":["Ryan"],"suffixes":[]}],"year":"2023","pages":"1–6","bibtex":"@inproceedings{warstadt_findings_2023,\n\taddress = {Singapore},\n\ttitle = {Findings of the {BabyLM} {Challenge}: {Sample}-{Efficient} {Pretraining} on {Developmentally} {Plausible} {Corpora}},\n\tshorttitle = {Findings of the {BabyLM} {Challenge}},\n\turl = {https://aclanthology.org/2023.conll-babylm.1},\n\tdoi = {10.18653/v1/2023.conll-babylm.1},\n\tabstract = {Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.},\n\tlanguage = {en},\n\turldate = {2024-01-16},\n\tbooktitle = {Proceedings of the {BabyLM} {Challenge} at the 27th {Conference} on {Computational} {Natural} {Language} {Learning}},\n\tpublisher = {Association for Computational Linguistics},\n\tauthor = {Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan},\n\tyear = {2023},\n\tpages = {1--6},\n}\n\n\n\n","author_short":["Warstadt, A.","Mueller, A.","Choshen, L.","Wilcox, E.","Zhuang, C.","Ciro, J.","Mosquera, R.","Paranjabe, B.","Williams, A.","Linzen, T.","Cotterell, R."],"key":"warstadt_findings_2023","id":"warstadt_findings_2023","bibbaseid":"warstadt-mueller-choshen-wilcox-zhuang-ciro-mosquera-paranjabe-etal-findingsofthebabylmchallengesampleefficientpretrainingondevelopmentallyplausiblecorpora-2023","role":"author","urls":{"Paper":"https://aclanthology.org/2023.conll-babylm.1"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero-group/schulzkx/5158478","dataSources":["JFDnASMkoQCjjGL8E"],"keywords":[],"search_terms":["findings","babylm","challenge","sample","efficient","pretraining","developmentally","plausible","corpora","warstadt","mueller","choshen","wilcox","zhuang","ciro","mosquera","paranjabe","williams","linzen","cotterell"],"title":"Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora","year":2023}