Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Jesse Atuhurra,Iqra Ali,Tatsuya Hiraoka,Hidetaka Kamigaito,Tomoya Iwakura,Taro Watanabe
2024-03-29
Abstract:Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages. To further explore the strengths of VLMs, such as GPT-4V \cite{openai2023GPT4}, we developed new datasets for the systematic and qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing \textit{questions} and prompting GPT4-V to generate the \textit{answers} and the \textit{rationales}, 2) introduced a new VL task named \textit{unrelatedness}, 3) introduced rationales to enable human understanding of the VLM reasoning process, and 4) employed human evaluation to measure the suitability of proposed datasets for VL tasks. We show that VLMs can be fine-tuned on our datasets. Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces \textit{rationales} in VL analysis, which played a vital role in the evaluation.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the evaluation issues of Visual Language Models (VLMs) in multilingual environments, particularly the lack of comprehensive evaluation for languages other than English. The authors note that most current datasets are primarily focused on English and that the descriptive information in image-text pairs is not rich enough, limiting thorough research on VLMs' visual understanding capabilities across different languages and domains. To address these issues, the research team made the following contributions: 1. **Constructing a Multilingual Visual Text Dataset**: They introduced nine visual and language tasks and constructed a multilingual visual text dataset that includes four languages: English, Japanese, Swahili, and Urdu. These datasets were created by generating questions using templates and having GPT-4V generate answers and their reasoning processes (i.e., "rationales"), thus creating rich image-text pairs. 2. **Proposing New Visual Language Tasks**: In addition to traditional tasks, they introduced a new task called "irrelevance" to measure the VLM's ability to identify text parts that are irrelevant to the image. 3. **Including Reasoning Process Explanations (Rationales)**: To facilitate human understanding of the VLM's reasoning process, the dataset includes the reasons or explanations (rationales) generated by the model for its answers. 4. **Human Evaluation**: They recruited native speakers of each language for human evaluation to measure whether the proposed datasets are suitable for visual language tasks. In summary, the main goal of this paper is to reveal the fine-grained visual language capabilities of Visual Language Models in multilingual environments and to achieve this by constructing new multilingual datasets and introducing tasks and evaluation methods. Additionally, this is the first time such analysis has been conducted in Swahili and Urdu.