IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Mohammed Safi Ur Rahman Khan,Priyam Mehta,Ananth Sankar,Umashankar Kumaravelan,Sumanth Doddapaneni,Suriyaprasaad B,Varun Balan G,Sparsh Jain,Anoop Kunchukuttan,Pratyush Kumar,Raj Dabre,Mitesh M. Khapra
DOI: https://doi.org/10.18653/v1/2024.acl-long.843
2024-11-29
Abstract:Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the scarcity of resources for Indian languages in large - language models (LLMs). Despite significant progress in building English LLMs, the development of models for other languages has been hindered due to a lack of customized resources. Specifically, although languages in the Indian subcontinent are spoken by more than 1.4 billion people, they are scarcely represented in the training datasets and tokenizers of current open - source LLMs, resulting in the neglect of the cultural backgrounds and nuances of these languages. To bridge this gap, the authors introduce a comprehensive resource suite named IndicLLMSuite, which aims to develop pre - training and fine - tuning datasets for 22 Indian languages. The following are the main contributions of this paper: 1. **SANGRAHA**: Pre - training data containing 251 billion tokens, covering content extracted from selected URLs, existing multilingual corpora, and large - scale translations. 2. **SETU**: A Spark - based distributed pipeline specifically designed to extract content from websites, PDFs, and videos, with built - in cleaning, filtering, toxicity removal, and deduplication stages. 3. **INDIC ALIGN - INSTRUCT**: A dataset containing 74.8 million instruction - response pairs, collected through four methods: aggregating existing instruction - fine - tuning datasets, translating English datasets into 14 Indian languages, using open - source LLMs to create contextual conversations from Indian Wikipedia articles, and establishing a crowdsourcing platform named Anudesh to collect instructions. 4. **INDIC ALIGN - TOXIC**: 123,000 pairs of toxic prompts and non - toxic responses for the safety alignment of Indian - language LLMs. Through these resources, the authors hope not only to promote the research and development of Indian - language LLMs but also to provide an open - source blueprint for other languages. All codes, tools, and datasets will be publicly released to facilitate community cooperation in jointly training high - quality Indian - language LLMs. ### Formula Representation No specific mathematical, physical, or chemical formulas are involved in the description, so there is no need to present formulas in Markdown format. If formulas are involved in subsequent discussions, I will ensure that they are presented in the correct Markdown format. ### Security and Accuracy This paper focuses on scientific research and technological development and does not involve any terrorism, racial discrimination, or pornographic and violent content. The information provided is intended to help users understand how to solve the problem of resource scarcity for Indian languages in large - language models.