Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection

Gretel Liz De la Peña Sarracén,Paolo Rosso,Robert Litschko,Goran Glavaš,Simone Paolo Ponzetto
2023-11-04
Abstract:Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of abuse language detection in cross - language few - shot learning. Specifically, researchers focus on how to improve the performance of abuse language detection in target languages with limited resources. Although cross - language transfer learning has shown encouraging results in the transfer from high - resource languages to medium - and low - resource languages, the scarcity of target - language resources remains a challenge. To this end, the paper explores methods such as data augmentation and continuous pre - training to improve cross - language abuse language detection. ### Main contributions 1. **Dataset expansion**: Researchers rely on a multi - domain and multi - language abuse language detection dataset and expand it to Spanish through human translation. 2. **Improvement of cross - language few - shot transfer learning at the data level**: Use Vicinal Risk Minimization (VRM) to generate synthetic samples and increase the amount of information in the target - language fine - tuned model. Three VRM - based techniques are used in the study: SSMBA, MIXUP and MIXAG. 3. **Unsupervised language adaptation**: Simulate a completely unsupervised setting, remove label information of the target language, explore coping strategies when there is a lack of information in zero - shot transfer, and perform domain adaptation through Masked Language Modeling (MLM). ### Research questions 1. **RQ1**: What is the role of VRM - based techniques in cross - language few - shot abuse language detection? 2. **RQ2**: What are the effects of different languages on cross - language few - shot abuse language detection? 3. **RQ3**: How do VRM - based techniques perform in the domain specialization of cross - language abuse language detection models? ### Experimental results - **Effect of VRM techniques**: Experimental results show that VRM - based techniques can improve the performance of cross - language few - shot transfer learning in most cases. In particular, multilingual MIXAG performs best in multi - domain and multi - language environments. - **Language influence**: Except for German, other languages benefit from cross - language few - shot transfer learning. SSMBA performs well in all languages but has a poor effect in the TRAC domain. - **Multilingual strategy**: Multilingual MIXAG significantly outperforms other variants, indicating that controlling the angle between the original text and the newly synthesized text is important in multilingual data. - **Unsupervised language adaptation**: Through masked language modeling of unlabeled data, the model's adaptability to target - language abuse terms can be improved in zero - shot transfer. ### Conclusion This paper significantly improves the performance of cross - language few - shot abuse language detection by introducing VRM - based data augmentation techniques, especially in multi - language and multi - domain environments. In addition, the study also explores how to improve the performance of the model through domain adaptation in the case of unlabeled data. These results are of great significance for languages with limited resources.