Abstract:In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages. Our code is publicly released at <a class="link-external link-https" href="https://github.com/SamuelCahyawijaya/in-context-alignment" rel="external noopener nofollow">this https URL</a>

GlotLID: Language Identification for Low-Resource Languages

GlotScript: A Resource and Tool for Low Resource Writing System Identification

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Short Text Language Identification for Under Resourced Languages

PinLID: a dataset for Pinglish language identiftcation based on code-mixing sentence on unstructured resources

MaskLID: Code-Switching Language Identification through Iterative Masking

Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings

LIDE: Language Identification from Text Documents

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

Geographically-Informed Language Identification

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Low-Resource Language Identification From Speech Using Transfer Learning

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

From N-grams to Pre-trained Multilingual Models For Language Identification

Language Variety Identification with True Labels

LLMs Are Few-Shot In-Context Low-Resource Language Learners

Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings

A New Massive Multilingual Dataset for High-Performance Language Technologies

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Language Identification for Austronesian Languages