Abstract:The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model, by following the Alpaca format. This work contributes to the research on multilingual models tailored for low-resource settings, a crucial endeavor in ensuring the inclusion of all linguistic communities in the development of Large Language Models. Another noteworthy aspect of this research is the exploration of how knowledge of a closely related language, in this case, Portuguese, can assist in generating coherent text when training resources are scarce. Both the Galician Alpaca dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we have made the source code available to facilitate replication of this experiment and encourage further advancements for underrepresented languages.

ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

XDailyDialog: A Multilingual Parallel Dialogue Corpus

SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan

Audio Dialogues: Dialogues dataset for audio and music understanding

Fine-grained Emotion and Intent Learning in Movie Dialogues

SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation

Dialogs Re-enacted Across Languages

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents

Conversational Feedback in Scripted versus Spontaneous Dialogues: A Comparative Analysis

Searching for Snippets of Open-Domain Dialogue in Task-Oriented Dialogue Datasets

Conversations in Galician: a Large Language Model for an Underrepresented Language

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

A Resource for Computational Experiments on Mapudungun

MedDialog: Two Large-scale Medical Dialogue Datasets

Interview: A Large-Scale Open-Source Corpus of Media Dialog

MedDialog: A Large-scale Medical Dialogue Dataset