Abstract:Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existing labels/datasets in different languages. This research proposes a novel perspective on Universal Cross-Lingual Text Classification, leveraging a unified model across languages. Our approach involves blending supervised data from different languages during training to create a universal model. The supervised data for a target classification task might come from different languages covering different labels. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages. We propose the usage of a strong multilingual SBERT as our base model, making our novel training strategy feasible. This strategy contributes to the adaptability and effectiveness of the model in cross-lingual language transfer scenarios, where it can categorize text in languages not encountered during training. Thus, the paper delves into the intricacies of cross-lingual text classification, with a particular focus on its application for low-resource languages, exploring methodologies and implications for the development of a robust and adaptable universal cross-lingual model.

What problem does this paper attempt to address?

The paper mainly addresses the challenges of low-resource languages in natural language processing (NLP) tasks, particularly the issue of scarce supervised data in text classification tasks. Specifically, the goal of the paper is to optimize existing labels and datasets in different languages through a new approach to achieve broader label coverage and improve support for low-resource languages. The core contribution of the paper is the proposal of a universal cross-lingual text classification method aimed at training a unified model capable of handling multiple languages and labels. This method includes the following key points: 1. **Problem Background**: The main issue faced by low-resource languages is the lack of annotated corpora, dictionaries, and grammatical resources, which limits the amount of data and types of labels available for training. Although existing multilingual models can address cross-lingual tasks to some extent, they are still constrained by the label space of the single language used during training. 2. **Solution**: To overcome these limitations, the paper proposes a new strategy of mixing supervised data from different languages for training. The goal of this approach is to create a "universal" model capable of handling all languages and all labels, thereby enhancing the model's adaptability and effectiveness. 3. **Methodology**: The paper employs a powerful multilingual Sentence-BERT as the base model and validates its performance in cross-lingual scenarios through experiments. The study also compares the performance of different models, including LaBSE and LASER, and ultimately selects IndicSBERT as the best candidate model. 4. **Experimental Results**: Through a series of experiments, including cross-lingual text classification and universal cross-lingual text classification, the paper demonstrates that the proposed model can effectively improve the ability to classify text in unseen languages, particularly excelling in terms of label coverage. In summary, this paper provides an effective solution to the challenges of low-resource languages in the field of natural language processing by proposing a novel universal cross-lingual text classification method.

Universal Cross-Lingual Text Classification

Expanding the Text Classification Toolbox with Cross-Lingual Embeddings

Automatic Generation of Language-Independent Features for Cross-Lingual Classification

Cross-Lingual Task-Specific Representation Learning for Text Classification in Resource Poor Languages

Bridging the domain gap in cross-lingual document classification

Cross Language Text Categorization Using a Bilingual Lexicon.

Cross-lingual Data Transformation and Combination for Text Classification

Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence

Towards a Unified End-to-End Approach for Fully Unsupervised Cross-Lingual Sentiment Analysis.

A Survey of Multilingual Models for Automatic Speech Recognition

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification

Towards a Universal Sentiment Classifier in Multiple Languages

Cross-Domain Labeled LDA for Cross-Domain Text Classification

Multilingual text classification using deep learning

Can Monolingual Pretrained Models Help Cross-Lingual Classification?

A survey on text classification: Practical perspectives on the Italian language.

A Data Bootstrapping Recipe for Low Resource Multilingual Relation Classification

Iterative Reinforcement Cross-Domain Text Classification

Multilingual text categorization and sentiment analysis: a comparative analysis of the utilization of multilingual approaches for classifying twitter data

A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT