Abstract:Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in non-English languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a (quasi)-zero-shot manner, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at <a class="link-external link-https" href="https://github.com/OpenBMB/VisCPM.git" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to train competitive large - scale multimodal models in non - English languages lacking multimodal data resources. Specifically, the paper proposes an effective training paradigm, MPM (Multilingual Pivot Multimodal Learning), which aims to use powerful multilingual large language models (LLMs) as intermediaries to achieve zero - sample or multilingual multimodal learning transfer from English to other languages. This method can not only reduce the dependence on local multimodal data, but also enable the performance of multimodal models in non - English languages to reach or even exceed that of models trained with local - language multimodal data. ### Specific problem description: 1. **Limitations of multimodal learning**: At present, most successful multimodal models (such as image - to - text generation and text - to - image generation) are mainly concentrated in the English - speaking community, and the multimodal capabilities of other non - English languages are far behind. 2. **Insufficient data resources**: There is a lack of large - scale, high - quality image - text pair data in non - English languages, resulting in slow progress in multimodal research in these languages. 3. **Cross - language transfer**: How to use the existing rich English multimodal data and, through the intermediary role of multilingual models, achieve the transfer of multimodal capabilities to non - English languages. ### Solutions: - **MPM training paradigm**: Through two - stage training (multilingual alignment and multimodal alignment), use a pre - trained multilingual large language model (such as CPM - Bee) as an intermediary to achieve multimodal capability transfer from English to the target language. - **Practical applications**: Taking Chinese as an example, the VISCPM series of models have been developed, including VISCPM - Chat for image - to - text generation and VISCPM - Paint for text - to - image generation. These models perform excellently in Chinese multimodal tasks and even outperform models trained with local Chinese multimodal data. ### Main contributions: 1. **Proposing MPM**: An effective training paradigm specifically designed for non - English languages lacking multimodal resources. 2. **Developing VISCPM**: A series of large - scale Chinese multimodal models that achieve state - of - the - art performance in Chinese multimodal tasks. 3. **Open - sourcing code and model weights**: Provide detailed experimental details for reference by other researchers. 4. **Multilingual verification**: Verify the generalization ability of VISCPM in multiple languages and develop a multilingual multimodal dialogue model supporting six languages. Through these methods, the paper effectively solves the problem of insufficient data resources in non - English - language multimodal model training and demonstrates the potential of cross - language multimodal learning.

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

Multimodal Pretraining from Monolingual to Multilingual

Multimodal Large Language Models: A Survey

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Unified Generative and Discriminative Training for Multi-modal Large Language Models

A Survey on Multimodal Large Language Models

On the Hidden Mystery of OCR in Large Multimodal Models

Efficient Multimodal Learning from Data-centric Perspective

Efficient Multimodal Large Language Models: A Survey

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

Cross-Modal Consistency in Multimodal Large Language Models

A Survey of Multimodal Large Language Model from A Data-centric Perspective

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Few-shot Learning with Multilingual Language Models

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

CPM-2: Large-scale Cost-effective Pre-trained Language Models