Abstract:GuessWhich is an engaging visual dialogue game that involves interaction between a Questioner Bot (QBot) and an Answer Bot (ABot) in the context of image-guessing. In this game, QBot's objective is to locate a concealed image solely through a series of visually related questions posed to ABot. However, effectively modeling visually related reasoning in QBot's decision-making process poses a significant challenge. Current approaches either lack visual information or rely on a single real image sampled at each round as decoding context, both of which are inadequate for visual reasoning. To address this limitation, we propose a novel approach that focuses on visually related reasoning through the use of a mental model of the undisclosed image. Within this framework, QBot learns to represent mental imagery, enabling robust visual reasoning by tracking the dialogue state. The dialogue state comprises a collection of representations of mental imagery, as well as representations of the entities involved in the conversation. At each round, QBot engages in visually related reasoning using the dialogue state to construct an internal representation, generate relevant questions, and update both the dialogue state and internal representation upon receiving an answer. Our experimental results on the VisDial datasets (v0.5, 0.9, and 1.0) demonstrate the effectiveness of our proposed model, as it achieves new state-of-the-art performance across all metrics and datasets, surpassing previous state-of-the-art models. Codes and datasets from our experiments are freely available at \href{<a class="link-external link-https" href="https://github.com/xubuvd/GuessWhich" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper primarily addresses the challenges faced by the questioner robot (QBot) in the visual dialogue game "GuessWhich." In the GuessWhich game, QBot needs to guess the hidden image by asking a series of image-related questions to the answerer robot (ABot). However, effectively modeling this visual reasoning process is a significant challenge for QBot. Current methods have two main issues: 1. Methods that lack visual information or rely solely on textual information (non-visual methods) cannot fully utilize visual cues for reasoning. 2. Methods that rely on extracting a single real image from a large number of candidate images as decoding context (real image methods) are not only unnatural but also introduce significant sampling bias, leading to an unstable reasoning process. To address these issues, the paper proposes a new method, a GuessWhich game model based on multimodal dialogue state tracking (DST). The core of this method is to construct QBot's mental representation of the undisclosed image and perform visual reasoning by tracking the dialogue state. Specifically, the DST model includes the following key components: - **Recursive Self-Referential Equation (R-SRE)**: Used to capture visual-related interactions within the dialogue state and across different modalities. - **Visual Reasoning based on Dialogue State (VRDS)**: Conducts three-step reasoning, from text to text, then to image, and back to text, generating internal representations. - **Question Decoder (QDer)**: Uses the generated internal representations to generate new questions. - **QBot Encoder (QEnc)**: Utilizes the pre-trained visual language model ViLBERT to process input data. - **State Tracking (STrack)**: Updates the dialogue state by adding new entities or updating existing entities. Experimental results show that the proposed DST model achieves state-of-the-art performance on the VisDial v0.5, v0.9, and v1.0 datasets, particularly excelling in the image guessing task, significantly outperforming previous baseline models. Additionally, case studies and ablation experiments further validate the effectiveness and importance of each module.

Multi-Modal Dialogue State Tracking for Playing GuessWhich Game

Guessing State Tracking for Visual Dialogue

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser.

Hybrid Graph Reasoning with Dynamic Interaction for Visual Dialog

Multi-Granularity Semantic Collaborative Reasoning Network for Visual Dialog

Modality-Balanced Models for Visual Dialogue

Category-Based Strategy-Driven Question Generator for Visual Dialogue.

Multi-Modal Fusion with Multi-Level Attention for Visual Dialog.

Learning cooperative visual dialog agents with deep reinforcement learning

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

GuessWhich? Visual Dialog with Attentive Memory Network.

Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue

Multimodal Dialogue State Tracking

Improving generative visual dialog by answering diverse questions

Exploring Contextual-Aware Representation and Linguistic-Diverse Expression for Visual Dialog.

Beyond the Granularity: Multi-Perspective Dialogue Collaborative Selection for Dialogue State Tracking

Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene

HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog