Trained on 100 million words and still in shape: BERT meets British National Corpus

Trained on 100 million words and still in shape: BERT meets British National Corpus. Samuel, D., Kutuzov, A., Øvrelid, L., & Velldal, E. March, 2023. 0 citations (Semantic Scholar/arXiv) [2023-05-05] arXiv:2303.09859 [cs]

Paper abstract bibtex

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, wellbalanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efﬁcient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

@misc{samuel_trained_2023,
	title = {Trained on 100 million words and still in shape: {BERT} meets {British} {National} {Corpus}},
	shorttitle = {Trained on 100 million words and still in shape},
	url = {http://arxiv.org/abs/2303.09859},
	abstract = {While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, wellbalanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efﬁcient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.},
	language = {en},
	urldate = {2023-05-05},
	publisher = {arXiv},
	author = {Samuel, David and Kutuzov, Andrey and Øvrelid, Lilja and Velldal, Erik},
	month = mar,
	year = {2023},
	note = {0 citations (Semantic Scholar/arXiv) [2023-05-05]
arXiv:2303.09859 [cs]},
	keywords = {Computer Science - Computation and Language},
}

Downloads: 0

{"_id":"gBWtd3ZHM43pyuNrN","bibbaseid":"samuel-kutuzov-vrelid-velldal-trainedon100millionwordsandstillinshapebertmeetsbritishnationalcorpus-2023","author_short":["Samuel, D.","Kutuzov, A.","Øvrelid, L.","Velldal, E."],"bibdata":{"bibtype":"misc","type":"misc","title":"Trained on 100 million words and still in shape: BERT meets British National Corpus","shorttitle":"Trained on 100 million words and still in shape","url":"http://arxiv.org/abs/2303.09859","abstract":"While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, wellbalanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efﬁcient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.","language":"en","urldate":"2023-05-05","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Samuel"],"firstnames":["David"],"suffixes":[]},{"propositions":[],"lastnames":["Kutuzov"],"firstnames":["Andrey"],"suffixes":[]},{"propositions":[],"lastnames":["Øvrelid"],"firstnames":["Lilja"],"suffixes":[]},{"propositions":[],"lastnames":["Velldal"],"firstnames":["Erik"],"suffixes":[]}],"month":"March","year":"2023","note":"0 citations (Semantic Scholar/arXiv) [2023-05-05] arXiv:2303.09859 [cs]","keywords":"Computer Science - Computation and Language","bibtex":"@misc{samuel_trained_2023,\n\ttitle = {Trained on 100 million words and still in shape: {BERT} meets {British} {National} {Corpus}},\n\tshorttitle = {Trained on 100 million words and still in shape},\n\turl = {http://arxiv.org/abs/2303.09859},\n\tabstract = {While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, wellbalanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efﬁcient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.},\n\tlanguage = {en},\n\turldate = {2023-05-05},\n\tpublisher = {arXiv},\n\tauthor = {Samuel, David and Kutuzov, Andrey and Øvrelid, Lilja and Velldal, Erik},\n\tmonth = mar,\n\tyear = {2023},\n\tnote = {0 citations (Semantic Scholar/arXiv) [2023-05-05]\narXiv:2303.09859 [cs]},\n\tkeywords = {Computer Science - Computation and Language},\n}\n\n","author_short":["Samuel, D.","Kutuzov, A.","Øvrelid, L.","Velldal, E."],"key":"samuel_trained_2023","id":"samuel_trained_2023","bibbaseid":"samuel-kutuzov-vrelid-velldal-trainedon100millionwordsandstillinshapebertmeetsbritishnationalcorpus-2023","role":"author","urls":{"Paper":"http://arxiv.org/abs/2303.09859"},"keyword":["Computer Science - Computation and Language"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/ifromm","dataSources":["N4kJAiLiJ7kxfNsoh"],"keywords":["computer science - computation and language"],"search_terms":["trained","100","million","words","still","shape","bert","meets","british","national","corpus","samuel","kutuzov","øvrelid","velldal"],"title":"Trained on 100 million words and still in shape: BERT meets British National Corpus","year":2023}