Abstract:Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of abuse language detection in cross - language few - shot learning. Specifically, researchers focus on how to improve the performance of abuse language detection in target languages with limited resources. Although cross - language transfer learning has shown encouraging results in the transfer from high - resource languages to medium - and low - resource languages, the scarcity of target - language resources remains a challenge. To this end, the paper explores methods such as data augmentation and continuous pre - training to improve cross - language abuse language detection. ### Main contributions 1. **Dataset expansion**: Researchers rely on a multi - domain and multi - language abuse language detection dataset and expand it to Spanish through human translation. 2. **Improvement of cross - language few - shot transfer learning at the data level**: Use Vicinal Risk Minimization (VRM) to generate synthetic samples and increase the amount of information in the target - language fine - tuned model. Three VRM - based techniques are used in the study: SSMBA, MIXUP and MIXAG. 3. **Unsupervised language adaptation**: Simulate a completely unsupervised setting, remove label information of the target language, explore coping strategies when there is a lack of information in zero - shot transfer, and perform domain adaptation through Masked Language Modeling (MLM). ### Research questions 1. **RQ1**: What is the role of VRM - based techniques in cross - language few - shot abuse language detection? 2. **RQ2**: What are the effects of different languages on cross - language few - shot abuse language detection? 3. **RQ3**: How do VRM - based techniques perform in the domain specialization of cross - language abuse language detection models? ### Experimental results - **Effect of VRM techniques**: Experimental results show that VRM - based techniques can improve the performance of cross - language few - shot transfer learning in most cases. In particular, multilingual MIXAG performs best in multi - domain and multi - language environments. - **Language influence**: Except for German, other languages benefit from cross - language few - shot transfer learning. SSMBA performs well in all languages but has a poor effect in the TRAC domain. - **Multilingual strategy**: Multilingual MIXAG significantly outperforms other variants, indicating that controlling the angle between the original text and the newly synthesized text is important in multilingual data. - **Unsupervised language adaptation**: Through masked language modeling of unlabeled data, the model's adaptability to target - language abuse terms can be improved in zero - shot transfer. ### Conclusion This paper significantly improves the performance of cross - language few - shot abuse language detection by introducing VRM - based data augmentation techniques, especially in multi - language and multi - domain environments. In addition, the study also explores how to improve the performance of the model through domain adaptation in the case of unlabeled data. These results are of great significance for languages with limited resources.

Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection

Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning

How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have

Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection

Investigating cross-lingual training for offensive language detection

XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering

Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Cross-Domain Few-Shot Classification Via Adversarial Task Augmentation

Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

Enhancing abusive language detection: A domain-adapted approach leveraging BERT pre-training tasks

Data Augmentations for Improved (Large) Language Model Generalization

Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding

Cross-lingual Text-independent Speaker Verification using Unsupervised Adversarial Discriminative Domain Adaptation

Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models

Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Cross-lingual offensive speech identification with transfer learning for low-resource languages

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

Exploring data augmentation in bias mitigation against non-native-accented speech