Trained on 100 million words and still in shape: BERT meets British National Corpus. Samuel, D., Kutuzov, A., Øvrelid, L., & Velldal, E. March, 2023. 0 citations (Semantic Scholar/arXiv) [2023-05-05] arXiv:2303.09859 [cs]
Trained on 100 million words and still in shape: BERT meets British National Corpus [link]Paper  abstract   bibtex   
While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, wellbalanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.
@misc{samuel_trained_2023,
	title = {Trained on 100 million words and still in shape: {BERT} meets {British} {National} {Corpus}},
	shorttitle = {Trained on 100 million words and still in shape},
	url = {http://arxiv.org/abs/2303.09859},
	abstract = {While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, wellbalanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.},
	language = {en},
	urldate = {2023-05-05},
	publisher = {arXiv},
	author = {Samuel, David and Kutuzov, Andrey and Øvrelid, Lilja and Velldal, Erik},
	month = mar,
	year = {2023},
	note = {0 citations (Semantic Scholar/arXiv) [2023-05-05]
arXiv:2303.09859 [cs]},
	keywords = {Computer Science - Computation and Language},
}

Downloads: 0