Abstract:GuessWhich is an engaging visual dialogue game that involves interaction between a Questioner Bot (QBot) and an Answer Bot (ABot) in the context of image-guessing. In this game, QBot's objective is to locate a concealed image solely through a series of visually related questions posed to ABot. However, effectively modeling visually related reasoning in QBot's decision-making process poses a significant challenge. Current approaches either lack visual information or rely on a single real image sampled at each round as decoding context, both of which are inadequate for visual reasoning. To address this limitation, we propose a novel approach that focuses on visually related reasoning through the use of a mental model of the undisclosed image. Within this framework, QBot learns to represent mental imagery, enabling robust visual reasoning by tracking the dialogue state. The dialogue state comprises a collection of representations of mental imagery, as well as representations of the entities involved in the conversation. At each round, QBot engages in visually related reasoning using the dialogue state to construct an internal representation, generate relevant questions, and update both the dialogue state and internal representation upon receiving an answer. Our experimental results on the VisDial datasets (v0.5, 0.9, and 1.0) demonstrate the effectiveness of our proposed model, as it achieves new state-of-the-art performance across all metrics and datasets, surpassing previous state-of-the-art models. Codes and datasets from our experiments are freely available at \href{<a class="link-external link-https" href="https://github.com/xubuvd/GuessWhich" rel="external noopener nofollow">this https URL</a>}.

Exploring Contextual-Aware Representation and Linguistic-Diverse Expression for Visual Dialog.

SKANet - Structured Knowledge-Aware Network for Visual Dialog.

Context Gating with Multi-Level Ranking Learning for Visual Dialog

HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog

New Datasets and Models for Contextual Reasoning in Visual Dialog.

Iterative Context-Aware Graph Inference for Visual Dialog

Hybrid Graph Reasoning with Dynamic Interaction for Visual Dialog

Context-Aware Graph Inference with Knowledge Distillation for Visual Dialog

Multi-Modal Dialogue State Tracking for Playing GuessWhich Game

Modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue

Heterogeneous Knowledge Network for Visual Dialog

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

You should know more: Learning external knowledge for visual dialog

Recurrent Attention Network with Reinforced Generator for Visual Dialog

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

Learning to Ground Visual Objects for Visual Dialog

Video Dialog Via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset