Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

Jinyi Hu,Yuan Yao,Chongyi Wang,Shan Wang,Yinxu Pan,Qianyu Chen,Tianyu Yu,Hanghao Wu,Yue Zhao,Haoye Zhang,Xu Han,Yankai Lin,Jiao Xue,Dahai Li,Zhiyuan Liu,Maosong Sun
2024-03-22
Abstract:Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in non-English languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a (quasi)-zero-shot manner, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at <a class="link-external link-https" href="https://github.com/OpenBMB/VisCPM.git" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to train competitive large - scale multimodal models in non - English languages lacking multimodal data resources. Specifically, the paper proposes an effective training paradigm, MPM (Multilingual Pivot Multimodal Learning), which aims to use powerful multilingual large language models (LLMs) as intermediaries to achieve zero - sample or multilingual multimodal learning transfer from English to other languages. This method can not only reduce the dependence on local multimodal data, but also enable the performance of multimodal models in non - English languages to reach or even exceed that of models trained with local - language multimodal data. ### Specific problem description: 1. **Limitations of multimodal learning**: At present, most successful multimodal models (such as image - to - text generation and text - to - image generation) are mainly concentrated in the English - speaking community, and the multimodal capabilities of other non - English languages are far behind. 2. **Insufficient data resources**: There is a lack of large - scale, high - quality image - text pair data in non - English languages, resulting in slow progress in multimodal research in these languages. 3. **Cross - language transfer**: How to use the existing rich English multimodal data and, through the intermediary role of multilingual models, achieve the transfer of multimodal capabilities to non - English languages. ### Solutions: - **MPM training paradigm**: Through two - stage training (multilingual alignment and multimodal alignment), use a pre - trained multilingual large language model (such as CPM - Bee) as an intermediary to achieve multimodal capability transfer from English to the target language. - **Practical applications**: Taking Chinese as an example, the VISCPM series of models have been developed, including VISCPM - Chat for image - to - text generation and VISCPM - Paint for text - to - image generation. These models perform excellently in Chinese multimodal tasks and even outperform models trained with local Chinese multimodal data. ### Main contributions: 1. **Proposing MPM**: An effective training paradigm specifically designed for non - English languages lacking multimodal resources. 2. **Developing VISCPM**: A series of large - scale Chinese multimodal models that achieve state - of - the - art performance in Chinese multimodal tasks. 3. **Open - sourcing code and model weights**: Provide detailed experimental details for reference by other researchers. 4. **Multilingual verification**: Verify the generalization ability of VISCPM in multiple languages and develop a multilingual multimodal dialogue model supporting six languages. Through these methods, the paper effectively solves the problem of insufficient data resources in non - English - language multimodal model training and demonstrates the potential of cross - language multimodal learning.