Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios

Aditya Joshi,Diptesh Kanojia,Heather Lent,Hour Kaing,Haiyue Song
2024-09-19
Abstract:Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in `lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), Creoles (languages arising from linguistic contact between multiple languages) and other low-resource languages. This introductory tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts. By connecting past ideas to the present field, this tutorial aims to ignite collaboration and cross-pollination between researchers working in these scenarios. Our notion of `lower-resource' broadly denotes the outstanding lack of data required for model training - and may be applied to scenarios apart from the three covered in the tutorial.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?