Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Daryna Dementieva,Valeriia Khylenko,Georg Groh
2024-04-02
Abstract:Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issue of data scarcity in Ukrainian natural language processing (NLP) text classification tasks. Specifically, although there are numerous annotated datasets available in the NLP field, there is still a significant imbalance in data availability across different languages. Ukrainian, in particular, would benefit from further improvements in cross-lingual knowledge transfer methods due to the current lack of sufficient Ukrainian text classification corpora. To tackle this challenge, the authors explore various advanced cross-lingual knowledge transfer methods, including large-scale multilingual encoders, translation systems, large language models (LLMs), and language adapters (Adapters). These methods are applied to three specific text classification tasks: toxicity classification, formality classification, and natural language inference (NLI), to provide the optimal setup. ### Main Contributions: 1. **Designed the first text classification system for Ukrainian**: covering the tasks of toxicity classification, formality classification, and NLI. 2. **Explored four cross-lingual knowledge transfer methods**: Backtranslation, LLM Prompting, Training Corpus Translation, and Adapter Training. 3. **Tested the effectiveness of these methods on synthetic translation datasets and natural test sets**, providing an in-depth analysis of the effectiveness of various methods. Through this research, the authors hope to fill the gap in the existing literature and provide strong support for the development of Ukrainian NLP technology.