Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

Nathan Vaska,Victoria Helus

2023-06-02

Abstract:The impressive advances and applications of large language and joint language-and-visual understanding models has led to an increased need for methods of probing their potential reasoning capabilities. However, the difficulty of gather naturally-occurring data for complex multi-modal reasoning tasks bottlenecks the evaluation of AI methods on tasks which are not already covered by an academic dataset. In this work, we leverage recent advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks. We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task which is not well covered by existing datasets. We benchmark the performance of a state-of-the-art visual question answering (VQA) model against data generated with this method, and demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the problem of evaluating the capabilities of large language models (LLMs) and their joint visual understanding models in multimodal reasoning tasks. Specifically, the paper focuses on how to generate synthetic datasets for evaluation in the absence of naturally occurring complex multimodal reasoning task data. Existing academic datasets such as VQAv2 and LSMDC cannot comprehensively cover all reasoning tasks, especially in the task of context-related anomaly detection. Therefore, the paper proposes a method to create synthetic datasets using text-to-image generation techniques to evaluate the performance of multimodal models on data-scarce tasks. Using this method, the paper generates an image context-related anomaly dataset that is 100 times larger than the most similar existing dataset, using only publicly available data and minimal computational resources, without the need for human supervision. The experimental section demonstrates that state-of-the-art Visual Question Answering (VQA) models perform significantly worse on this synthetic dataset compared to their performance on standard VQA tasks, indicating a substantial gap in existing models for context-related anomaly detection tasks. Additionally, the paper proposes a similarity-based anomaly detection method as a benchmark, further validating the effectiveness and challenge of the generated dataset.

Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Visual Reasoning with Natural Language

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

Towards a Unified Multimodal Reasoning Framework

Multi-modal Situated Reasoning in 3D Scenes

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Probing Commonsense Reasoning Capability of Text-to-Image Generative Models Via Non-visual Description

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Convincing Rationales for Visual Question Answering Reasoning

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Learning Hierarchical Reasoning for Text-Based Visual Question Answering