Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Junyu Lu,Dixiang Zhang,Xiaojun Wu,Xinyu Gao,Ruyi Gan,Jiaxing Zhang,Yan Song,Pingjian Zhang
DOI: https://doi.org/10.48550/arXiv.2310.08166
2023-11-01
Abstract:Recent advancements enlarge the capabilities of large language models (LLMs) in zero-shot image-to-text generation and understanding by integrating multi-modal inputs. However, such success is typically limited to English scenarios due to the lack of large-scale and high-quality non-English multi-modal resources, making it extremely difficult to establish competitive counterparts in other languages. In this paper, we introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs) designed to incorporate visual semantics into LLM for multi-modal dialogue. Composed of Ziya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes such as instruction tuning, multi-stage training and low-rank adaptation module for visual-language alignment. In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese and generating instruction-response through the in-context learning method. The experiment results demonstrate that compared to the existing LVLMs, Ziya-Visual achieves competitive performance across a wide range of English-only tasks including zero-shot image-text retrieval, image captioning, and visual question answering. The evaluation leaderboard accessed by GPT-4 also indicates that our models possess satisfactory image-text understanding and generation capabilities in Chinese multi-modal scenario dialogues. Code, demo and models are available at ~\url{<a class="link-external link-https" href="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the poor performance of current multimodal language models in non - English environments. Specifically, although large vision - language models (LVLMs) have achieved remarkable success in English scenarios, due to the lack of large - scale, high - quality non - English multimodal resources, these models are difficult to achieve the same level of performance in other languages. To solve this problem, the paper introduces the Ziya - Visual series of models, which are a set of bilingual large - scale vision - language models, aiming to integrate visual semantics into large - language models (LLMs) to support multimodal conversations. The Ziya - Visual series of models solve the above problems in the following aspects: 1. **Bilingual ability**: The Ziya - Visual model not only supports English, but also specifically strengthens the support for Chinese, enabling it to conduct effective conversations and understanding in Chinese multimodal scenarios. 2. **Multi - task instruction tuning**: In order to improve the model's multimodal understanding and generation ability, Ziya - Visual adopts a multi - task instruction tuning strategy, including instruction tuning, multi - stage training, and a low - rank adaptation module. These methods help to better align visual and linguistic representations. 3. **Dataset construction**: To support the training of the model, the researchers constructed a bilingual multimodal in - context (BMMIC) dataset containing more than 5 million image - text pairs. This dataset uses GPT - 4 to automatically translate and generate Chinese visual - language question - answer pairs, thereby enriching non - English multimodal resources. 4. **Model architecture**: The Ziya - Visual model is based on the pre - trained Ziya - LLaMA - 13B language model, and introduces Vision Transformer as a visual encoder, and Q - Former as a query transformer. The latter is used to compress image features and align them with text. 5. **Experimental verification**: Through experiments on multiple benchmark tests, the paper shows that the Ziya - Visual model has achieved performance comparable to or even better than the existing best monolingual LVLMs in tasks such as zero - sample image - text retrieval, image caption generation, and visual question answering, especially performing excellently in Chinese multimodal tasks. In conclusion, by constructing bilingual multimodal models and datasets, this paper aims to overcome the limitations of existing LVLMs in non - English scenarios and promote the application and development of multimodal technology in more languages.