Universal Cross-Lingual Text Classification

Riya Savant,Anushka Shelke,Sakshi Todmal,Sanskruti Kanphade,Ananya Joshi,Raviraj Joshi
DOI: https://doi.org/10.1109/I2CT61223.2024.10543381
2024-06-17
Abstract:Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existing labels/datasets in different languages. This research proposes a novel perspective on Universal Cross-Lingual Text Classification, leveraging a unified model across languages. Our approach involves blending supervised data from different languages during training to create a universal model. The supervised data for a target classification task might come from different languages covering different labels. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages. We propose the usage of a strong multilingual SBERT as our base model, making our novel training strategy feasible. This strategy contributes to the adaptability and effectiveness of the model in cross-lingual language transfer scenarios, where it can categorize text in languages not encountered during training. Thus, the paper delves into the intricacies of cross-lingual text classification, with a particular focus on its application for low-resource languages, exploring methodologies and implications for the development of a robust and adaptable universal cross-lingual model.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper mainly addresses the challenges of low-resource languages in natural language processing (NLP) tasks, particularly the issue of scarce supervised data in text classification tasks. Specifically, the goal of the paper is to optimize existing labels and datasets in different languages through a new approach to achieve broader label coverage and improve support for low-resource languages. The core contribution of the paper is the proposal of a universal cross-lingual text classification method aimed at training a unified model capable of handling multiple languages and labels. This method includes the following key points: 1. **Problem Background**: The main issue faced by low-resource languages is the lack of annotated corpora, dictionaries, and grammatical resources, which limits the amount of data and types of labels available for training. Although existing multilingual models can address cross-lingual tasks to some extent, they are still constrained by the label space of the single language used during training. 2. **Solution**: To overcome these limitations, the paper proposes a new strategy of mixing supervised data from different languages for training. The goal of this approach is to create a "universal" model capable of handling all languages and all labels, thereby enhancing the model's adaptability and effectiveness. 3. **Methodology**: The paper employs a powerful multilingual Sentence-BERT as the base model and validates its performance in cross-lingual scenarios through experiments. The study also compares the performance of different models, including LaBSE and LASER, and ultimately selects IndicSBERT as the best candidate model. 4. **Experimental Results**: Through a series of experiments, including cross-lingual text classification and universal cross-lingual text classification, the paper demonstrates that the proposed model can effectively improve the ability to classify text in unseen languages, particularly excelling in terms of label coverage. In summary, this paper provides an effective solution to the challenges of low-resource languages in the field of natural language processing by proposing a novel universal cross-lingual text classification method.