Abstract:Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from cross-lingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model's outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs. Codes are available at: <a class="link-external link-https" href="https://github.com/Stardust-y/XTVQA.git" rel="external noopener nofollow">this https URL</a>

Probing Visual Language Priors in VLMs

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Revisiting the Role of Language Priors in Vision-Language Models

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Good Questions Help Zero-Shot Image Reasoning

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

Are VLMs Really Blind

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering

Visually-Augmented Language Modeling

Rethinking VLMs and LLMs for Image Classification

Overcoming language priors with self-contrastive learning for visual question answering

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts