Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Heejeong Nam,Jinwoo Ahn
2024-11-21
Abstract:The ability to perform complex reasoning across multimodal inputs is essential for models to effectively interact with humans in real-world scenarios. Advancements in vision-language models have significantly improved performance on tasks that require processing explicit and direct textual inputs, such as Visual Question Answering (VQA) and Visual Grounding (VG). However, less attention has been given to improving the model capabilities to comprehend nuanced and ambiguous forms of communication. This presents a critical challenge, as human language in real-world interactions often convey hidden intentions that rely on context for accurate interpretation. To address this gap, we propose VAGUE, a multimodal benchmark comprising 3.9K indirect human utterances paired with corresponding scenes. Additionally, we contribute a model-based pipeline for generating prompt-solution pairs from input images. Our work aims to delve deeper into the ability of models to understand indirect communication and seek to contribute to the development of models capable of more refined and human-like interactions. Extensive evaluation on multiple VLMs reveals that mainstream models still struggle with indirect communication when required to perform complex linguistic and visual reasoning. We release our code and data at <a class="link-external link-https" href="https://github.com/Hazel-Heejeong-Nam/VAGUE.git" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current visual - language models' insufficient ability in dealing with indirect and ambiguous expressions. Specifically, although existing models perform well in handling explicit and direct text inputs, there is still a significant gap in understanding the nuances and implicit intentions in human communication. This ability is crucial for models to interact more naturally and intuitively with humans in the real world. To this end, the authors propose VAGUE, a multi - modal benchmark dataset containing 3,900 indirect human expressions and their corresponding scenarios, aiming to evaluate models' ability to interpret ambiguous text expressions in a multi - modal context and promote the development of models that can better understand indirect communication. The main contributions of the paper include: 1. **Constructing the VAGUE benchmark dataset**: This dataset is specifically used to evaluate models' ability to parse indirect text expressions in a multi - modal context. 2. **Proposing a data generation pipeline**: It is used to generate direct and indirect expressions, as well as their corresponding solutions and wrong options, in order to construct multiple - choice test questions. 3. **Evaluating multiple multi - modal models**: Eight multi - modal models of different sizes are evaluated on the proposed benchmark task, and ablation experiments with six different task prompts are carried out. Through these contributions, the paper aims to fill the gap in existing research in dealing with ambiguous and indirect expressions and promote the progress of models in understanding and interpreting complex human communication.