The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. Laurençon, H., Saulnier, L., Wang, T., Akiki, C., Moral, A. V. d., Scao, T. L., Werra, L. V., Mou, C., Ponferrada, E. G., Nguyen, H., Frohberg, J., Šaško, M., Lhoest, Q., McMillan-Major, A., Dupont, G., Biderman, S., Rogers, A., Allal, L. B., Toni, F. D., Pistilli, G., Nguyen, O., Nikpoor, S., Masoud, M., Colombo, P., Rosa, J. d. l., Villegas, P., Thrush, T., Longpre, S., Nagel, S., Weber, L., Muñoz, M. R., Zhu, J., Strien, D. V., Alyafeai, Z., Almubarak, K., Chien, V. M., Gonzalez-Dios, I., Soroa, A., Lo, K., Dey, M., Suarez, P. O., Gokaslan, A., Bose, S., Adelani, D. I., Phan, L., Tran, H., Yu, I., Pai, S., Chim, J., Lepercq, V., Ilic, S., Mitchell, M., Luccioni, S., & Jernite, Y. In October, 2022.

Paper abstract bibtex

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

@inproceedings{laurencon_bigscience_2022,
	title = {The {BigScience} {ROOTS} {Corpus}: {A} 1.{6TB} {Composite} {Multilingual} {Dataset}},
	shorttitle = {The {BigScience} {ROOTS} {Corpus}},
	url = {https://openreview.net/forum?id=UoEw6KigkUn},
	abstract = {As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.},
	language = {en},
	urldate = {2023-03-17},
	author = {Laurençon, Hugo and Saulnier, Lucile and Wang, Thomas and Akiki, Christopher and Moral, Albert Villanova del and Scao, Teven Le and Werra, Leandro Von and Mou, Chenghao and Ponferrada, Eduardo González and Nguyen, Huu and Frohberg, Jörg and Šaško, Mario and Lhoest, Quentin and McMillan-Major, Angelina and Dupont, Gérard and Biderman, Stella and Rogers, Anna and Allal, Loubna Ben and Toni, Francesco De and Pistilli, Giada and Nguyen, Olivier and Nikpoor, Somaieh and Masoud, Maraim and Colombo, Pierre and Rosa, Javier de la and Villegas, Paulo and Thrush, Tristan and Longpre, Shayne and Nagel, Sebastian and Weber, Leon and Muñoz, Manuel Romero and Zhu, Jian and Strien, Daniel Van and Alyafeai, Zaid and Almubarak, Khalid and Chien, Vu Minh and Gonzalez-Dios, Itziar and Soroa, Aitor and Lo, Kyle and Dey, Manan and Suarez, Pedro Ortiz and Gokaslan, Aaron and Bose, Shamik and Adelani, David Ifeoluwa and Phan, Long and Tran, Hieu and Yu, Ian and Pai, Suhas and Chim, Jenny and Lepercq, Violette and Ilic, Suzana and Mitchell, Margaret and Luccioni, Sasha and Jernite, Yacine},
	month = oct,
	year = {2022},
}

Downloads: 0