Abstract:Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the scarcity of resources for Indian languages in large - language models (LLMs). Despite significant progress in building English LLMs, the development of models for other languages has been hindered due to a lack of customized resources. Specifically, although languages in the Indian subcontinent are spoken by more than 1.4 billion people, they are scarcely represented in the training datasets and tokenizers of current open - source LLMs, resulting in the neglect of the cultural backgrounds and nuances of these languages. To bridge this gap, the authors introduce a comprehensive resource suite named IndicLLMSuite, which aims to develop pre - training and fine - tuning datasets for 22 Indian languages. The following are the main contributions of this paper: 1. **SANGRAHA**: Pre - training data containing 251 billion tokens, covering content extracted from selected URLs, existing multilingual corpora, and large - scale translations. 2. **SETU**: A Spark - based distributed pipeline specifically designed to extract content from websites, PDFs, and videos, with built - in cleaning, filtering, toxicity removal, and deduplication stages. 3. **INDIC ALIGN - INSTRUCT**: A dataset containing 74.8 million instruction - response pairs, collected through four methods: aggregating existing instruction - fine - tuning datasets, translating English datasets into 14 Indian languages, using open - source LLMs to create contextual conversations from Indian Wikipedia articles, and establishing a crowdsourcing platform named Anudesh to collect instructions. 4. **INDIC ALIGN - TOXIC**: 123,000 pairs of toxic prompts and non - toxic responses for the safety alignment of Indian - language LLMs. Through these resources, the authors hope not only to promote the research and development of Indian - language LLMs but also to provide an open - source blueprint for other languages. All codes, tools, and datasets will be publicly released to facilitate community cooperation in jointly training high - quality Indian - language LLMs. ### Formula Representation No specific mathematical, physical, or chemical formulas are involved in the description, so there is no need to present formulas in Markdown format. If formulas are involved in subsequent discussions, I will ensure that they are presented in the correct Markdown format. ### Security and Accuracy This paper focuses on scientific research and technological development and does not involve any terrorism, racial discrimination, or pornographic and violent content. The information provided is intended to help users understand how to solve the problem of resource scarcity for Indian languages in large - language models.

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Pretraining Data and Tokenizer for Indic LLM

IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

MILU: A Multi-task Indic Language Understanding Benchmark

INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

INDUS: Effective and Efficient Language Models for Scientific Applications

Building a Llama2-finetuned LLM for Odia Language Utilizing Domain Knowledge Instruction Set

Decoding the Diversity: A Review of the Indic AI Research Landscape

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling

Airavata: Introducing Hindi Instruction-tuned LLM

HindiLLM: Large Language Model for Hindi

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs

Efficient Continual Pre-training of LLMs for Low-resource Languages

InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain?

IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

Towards Building ASR Systems for the Next Billion Users