Multi-Modal Dialogue State Tracking for Playing GuessWhich Game

Wei Pang,Ruixue Duan,Jinfu Yang,Ning Li
DOI: https://doi.org/10.1007/978-981-99-8850-1_45
2024-08-16
Abstract:GuessWhich is an engaging visual dialogue game that involves interaction between a Questioner Bot (QBot) and an Answer Bot (ABot) in the context of image-guessing. In this game, QBot's objective is to locate a concealed image solely through a series of visually related questions posed to ABot. However, effectively modeling visually related reasoning in QBot's decision-making process poses a significant challenge. Current approaches either lack visual information or rely on a single real image sampled at each round as decoding context, both of which are inadequate for visual reasoning. To address this limitation, we propose a novel approach that focuses on visually related reasoning through the use of a mental model of the undisclosed image. Within this framework, QBot learns to represent mental imagery, enabling robust visual reasoning by tracking the dialogue state. The dialogue state comprises a collection of representations of mental imagery, as well as representations of the entities involved in the conversation. At each round, QBot engages in visually related reasoning using the dialogue state to construct an internal representation, generate relevant questions, and update both the dialogue state and internal representation upon receiving an answer. Our experimental results on the VisDial datasets (v0.5, 0.9, and 1.0) demonstrate the effectiveness of our proposed model, as it achieves new state-of-the-art performance across all metrics and datasets, surpassing previous state-of-the-art models. Codes and datasets from our experiments are freely available at \href{<a class="link-external link-https" href="https://github.com/xubuvd/GuessWhich" rel="external noopener nofollow">this https URL</a>}.
Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily addresses the challenges faced by the questioner robot (QBot) in the visual dialogue game "GuessWhich." In the GuessWhich game, QBot needs to guess the hidden image by asking a series of image-related questions to the answerer robot (ABot). However, effectively modeling this visual reasoning process is a significant challenge for QBot. Current methods have two main issues: 1. Methods that lack visual information or rely solely on textual information (non-visual methods) cannot fully utilize visual cues for reasoning. 2. Methods that rely on extracting a single real image from a large number of candidate images as decoding context (real image methods) are not only unnatural but also introduce significant sampling bias, leading to an unstable reasoning process. To address these issues, the paper proposes a new method, a GuessWhich game model based on multimodal dialogue state tracking (DST). The core of this method is to construct QBot's mental representation of the undisclosed image and perform visual reasoning by tracking the dialogue state. Specifically, the DST model includes the following key components: - **Recursive Self-Referential Equation (R-SRE)**: Used to capture visual-related interactions within the dialogue state and across different modalities. - **Visual Reasoning based on Dialogue State (VRDS)**: Conducts three-step reasoning, from text to text, then to image, and back to text, generating internal representations. - **Question Decoder (QDer)**: Uses the generated internal representations to generate new questions. - **QBot Encoder (QEnc)**: Utilizes the pre-trained visual language model ViLBERT to process input data. - **State Tracking (STrack)**: Updates the dialogue state by adding new entities or updating existing entities. Experimental results show that the proposed DST model achieves state-of-the-art performance on the VisDial v0.5, v0.9, and v1.0 datasets, particularly excelling in the image guessing task, significantly outperforming previous baseline models. Additionally, case studies and ablation experiments further validate the effectiveness and importance of each module.