Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Jesse Atuhurra,Iqra Ali,Tatsuya Hiraoka,Hidetaka Kamigaito,Tomoya Iwakura,Taro Watanabe

2024-03-29

Abstract:Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages. To further explore the strengths of VLMs, such as GPT-4V \cite{openai2023GPT4}, we developed new datasets for the systematic and qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing \textit{questions} and prompting GPT4-V to generate the \textit{answers} and the \textit{rationales}, 2) introduced a new VL task named \textit{unrelatedness}, 3) introduced rationales to enable human understanding of the VLM reasoning process, and 4) employed human evaluation to measure the suitability of proposed datasets for VL tasks. We show that VLMs can be fine-tuned on our datasets. Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces \textit{rationales} in VL analysis, which played a vital role in the evaluation.

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the evaluation issues of Visual Language Models (VLMs) in multilingual environments, particularly the lack of comprehensive evaluation for languages other than English. The authors note that most current datasets are primarily focused on English and that the descriptive information in image-text pairs is not rich enough, limiting thorough research on VLMs' visual understanding capabilities across different languages and domains. To address these issues, the research team made the following contributions: 1. **Constructing a Multilingual Visual Text Dataset**: They introduced nine visual and language tasks and constructed a multilingual visual text dataset that includes four languages: English, Japanese, Swahili, and Urdu. These datasets were created by generating questions using templates and having GPT-4V generate answers and their reasoning processes (i.e., "rationales"), thus creating rich image-text pairs. 2. **Proposing New Visual Language Tasks**: In addition to traditional tasks, they introduced a new task called "irrelevance" to measure the VLM's ability to identify text parts that are irrelevant to the image. 3. **Including Reasoning Process Explanations (Rationales)**: To facilitate human understanding of the VLM's reasoning process, the dataset includes the reasons or explanations (rationales) generated by the model for its answers. 4. **Human Evaluation**: They recruited native speakers of each language for human evaluation to measure whether the proposed datasets are suitable for visual language tasks. In summary, the main goal of this paper is to reveal the fine-grained visual language capabilities of Visual Language Models in multilingual environments and to achieve this by constructing new multilingual datasets and introducing tasks and evaluation methods. Additionally, this is the first time such analysis has been conducted in Swahili and Urdu.

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Vision-Language Models for Vision Tasks: A Survey

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

An Introduction to Vision-Language Modeling

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

What Is Missing in Multilingual Visual Reasoning and How to Fix It

VILA: On Pre-training for Visual Language Models

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Towards Interpreting Visual Information Processing in Vision-Language Models

An Empirical Evaluation of the GPT-4 Multimodal Language Model on Visualization Literacy Tasks

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Visually Descriptive Language Model for Vector Graphics Reasoning