Understanding the effects of language-specific class imbalance in multilingual fine-tuning

Vincent Jung,Lonneke van der Plas
2024-02-20
Abstract:We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.
Computation and Language
What problem does this paper attempt to address?
The paper mainly explores the issue of language-specific class imbalance during the multilingual fine-tuning process and attempts to propose solutions. ### Research Background and Problem In multilingual classification tasks, datasets often exhibit uneven label distribution across different languages. This imbalance can lead to decreased model performance, more pronounced language separation in the latent space, and models relying on non-informative features for prediction. Although existing research has focused on the issue of data imbalance in monolingual settings, the impact of language-specific class imbalance has not been fully studied. ### Main Findings 1. **Performance Degradation**: Models trained on imbalanced datasets perform poorly. 2. **Latent Space Separation**: Imbalanced datasets lead to more significant language separation in the latent space. 3. **Reliance on Non-informative Features**: SHAP value analysis reveals that models tend to rely on non-informative features for prediction on imbalanced datasets, partially turning into language identifiers. ### Solution The authors propose an improved traditional class weighting method, which calculates weights for each language and label combination (per-language class weighing) to mitigate the negative effects of imbalance. Experimental results show that this method effectively improves model performance and reduces language separation in the latent space. ### Conclusion By studying the phenomenon of language-specific class imbalance in multilingual datasets, the authors reveal its impact on model performance and feature reliance, and propose a simple and effective solution. This research reminds us to carefully consider the joint distribution of datasets in practical applications, rather than just the marginal distribution.