Abstract:Modular deep learning has been proposed for the efficient adaption of pre-trained models to new tasks, domains and languages. In particular, combining language adapters with task adapters has shown potential where no supervised data exists for a language. In this paper, we explore the role of language adapters in zero-shot cross-lingual transfer for natural language understanding (NLU) benchmarks. We study the effect of including a target-language adapter in detailed ablation studies with two multilingual models and three multilingual datasets. Our results show that the effect of target-language adapters is highly inconsistent across tasks, languages and models. Retaining the source-language adapter instead often leads to an equivalent, and sometimes to a better, performance. Removing the language adapter after training has only a weak negative effect, indicating that the language adapters do not have a strong impact on the predictions.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is whether the role of target - language adapters in zero - shot cross - language transfer and their impact on natural language understanding (NLU) tasks are consistent and effective. Specifically, through detailed ablation studies, the author explores the performance differences between using target - language adapters and retaining source - language adapters or not using language adapters at all in the absence of target - language supervised data. The paper mainly focuses on the following three research questions:
1. **Is the positive effect of target - language adapters consistent across different languages, models, and tasks?**
The author compares the performance differences between using target - language adapters and other settings (such as retaining source - language adapters or using only task adapters) to evaluate the robustness of target - language adapters.
2. **How much does the model depend on language adapters?**
The author tests the performance degradation of the model by removing the language adapters used in the training process (without replacement) to evaluate the degree of the model's dependence on language adapters.
3. **Does the amount of source - language and target - language pre - training data in the base model affect the effect of target - language adapters?**
The author analyzes the representations of the source - language and target - language in the base - model pre - training corpus and explores how these factors affect the effect of target - language adapters.
### Main Findings
1. **The effect of target - language adapters is inconsistent**
- The **XLM - R** model improves performance by an average of 2.4% across all tasks, source - language and target - language combinations.
- The **mBERT** model, on the other hand, has an average performance degradation of 2.1%.
- For the **XCOPA** dataset, target - language adapters are crucial for skill transfer, especially for the **XLM - R** model, but also have a certain role for the **mBERT** model.
- For the other two datasets (**PAWS - X** and **XNLI**), the results are more mixed. Even if the target - language adapters have an advantage, retaining the source - language adapters does not significantly affect performance.
2. **The model has a low dependence on language adapters**
- The **XLM - R** model has a weak performance degradation of only 1.6% after removing the language adapters.
- The **mBERT** model is more sensitive, with performance degradations of 2.9% and 5.0% respectively.
- This indicates that the contribution of language adapters is small and the model depends more on the multilingual ability of the frozen base model.
3. **The influence of pre - training resources is not obvious**
- The transfer effect from high - resource languages to low - resource languages is inconsistent, which is different from the observations in the named - entity recognition task.
- The performance of low - resource languages is improved on the **XLM - R** model and the **XNLI** dataset, but not obvious in other model - task combinations.
4. **Differences between different datasets**
- The **XCOPA** dataset has a stronger dependence on target - language adapters, while the **PAWS - X** and **XNLI** datasets can also achieve good cross - language transfer through the multilingual ability of the pre - training model without language adapters.
### Conclusion
The paper points out that the role of target - language adapters in zero - shot cross - language transfer is not as clear and consistent as expected. Although in some cases target - language adapters can significantly improve performance, in most cases, retaining source - language adapters or not using language adapters at all can achieve similar or even better results. This shows that the modular role of language adapters varies in different model, task, and language combinations, and future research needs to further explore its application value in specific scenarios.