Abstract:The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model, by following the Alpaca format. This work contributes to the research on multilingual models tailored for low-resource settings, a crucial endeavor in ensuring the inclusion of all linguistic communities in the development of Large Language Models. Another noteworthy aspect of this research is the exploration of how knowledge of a closely related language, in this case, Portuguese, can assist in generating coherent text when training resources are scarce. Both the Galician Alpaca dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we have made the source code available to facilitate replication of this experiment and encourage further advancements for underrepresented languages.

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

Open Generative Large Language Models for Galician

Conversations in Galician: a Large Language Model for an Underrepresented Language

A Resource for Computational Experiments on Mapudungun

Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact

SambaLingo: Teaching Large Language Models New Languages

Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

NLP Progress in Indigenous Latin American Languages

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars

Dólares or Dollars? Unraveling the Bilingual Prowess of Financial LLMs Between Spanish and English

Socially Responsible Data for Large Multilingual Language Models

The Massively Multilingual Natural Language Understanding 2022 (MMNLU-22) Workshop and Competition

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

MLS: A Large-Scale Multilingual Dataset for Speech Research

MarIA: Spanish Language Models

Latxa: An Open Language Model and Evaluation Suite for Basque

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures