Abstract:Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from cross-lingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model's outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs. Codes are available at: <a class="link-external link-https" href="https://github.com/Stardust-y/XTVQA.git" rel="external noopener nofollow">this https URL</a>

Towards Multi-Lingual Visual Question Answering

MaXM: Towards Multilingual Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

xGQA: Cross-Lingual Visual Question Answering

Towards Multilingual Audio-Visual Question Answering

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

Multi-Question Learning for Visual Question Answering

Multitask Learning for Visual Question Answering

EVJVQA Challenge: Multilingual Visual Question Answering

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Visual Question Answering As Reading Comprehension

Multimodal attention-driven visual question answering for Malayalam

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

VQA: Visual Question Answering

Achieving Human Parity on Visual Question Answering

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

All You May Need for VQA are Image Captions

From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities