The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurençon,Lucile Saulnier,Thomas Wang,Christopher Akiki,Albert Villanova del Moral,Teven Le Scao,Leandro Von Werra,Chenghao Mou,Eduardo González Ponferrada,Huu Nguyen,Jörg Frohberg,Mario Šaško,Quentin Lhoest,Angelina McMillan-Major,Gerard Dupont,Stella Biderman,Anna Rogers,Loubna Ben allal,Francesco De Toni,Giada Pistilli,Olivier Nguyen,Somaieh Nikpoor,Maraim Masoud,Pierre Colombo,Javier de la Rosa,Paulo Villegas,Tristan Thrush,Shayne Longpre,Sebastian Nagel,Leon Weber,Manuel Muñoz,Jian Zhu,Daniel Van Strien,Zaid Alyafeai,Khalid Almubarak,Minh Chien Vu,Itziar Gonzalez-Dios,Aitor Soroa,Kyle Lo,Manan Dey,Pedro Ortiz Suarez,Aaron Gokaslan,Shamik Bose,David Adelani,Long Phan,Hieu Tran,Ian Yu,Suhas Pai,Jenny Chim,Violette Lepercq,Suzana Ilic,Margaret Mitchell,Sasha Alexandra Luccioni,Yacine Jernite
2023-03-07
Abstract:As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.
Computation and Language,Artificial Intelligence