Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Brendan Park,Madeline Janecek,Naser Ezzati-Jivan,Yifeng Li,Ali Emami
2024-06-04
Abstract:Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally surpassing random guessing. Further error analysis identifies important areas for future research aimed at advancing text-to-image models in their ability to interpret and interact with the complex visual world.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to evaluate and improve the performance of text - to - image models in the task of pronoun disambiguation in the multimodal field. Specifically, although large - language models (LLMs) have achieved remarkable success in handling unimodal tasks such as the Winograd Schema Challenge (WSC), in multimodal tasks that require simultaneously understanding and processing text and images, the performance of these models still faces great challenges. To meet this challenge, the authors introduce a new dataset, WINOVIS, which is specifically designed to test the ability of text - to - image models to perform pronoun disambiguation in a multimodal context. The main contributions of the paper include: 1. **Multimodal Dataset Adapted from WSC (WINOVIS)**: A dataset containing 500 scenes, which is used for benchmarking the pronoun - disambiguation ability of text - to - image models in a visual context. 2. **New Evaluation Framework for Multimodal Disambiguation**: Metrics and methods are designed to distinguish the performance of models in pronoun disambiguation from other visual - processing challenges. 3. **In - depth Analysis of the Common Inference Ability of Stable Diffusion**: Experiments reveal that even the current state - of - the - art model such as Stable Diffusion 2.0 is far from achieving human - level performance in the task of pronoun disambiguation. Through these contributions, the paper aims to promote the development of pronoun - disambiguation techniques in the multimodal field and provide directions for future research.