Abstract:With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well <a class="link-external link-http" href="http://explored.Existing" rel="external noopener nofollow">this http URL</a> visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the <a class="link-external link-http" href="http://response.Some" rel="external noopener nofollow">this http URL</a> recent methods tackle the co-reference resolution problem using co-attention mechanism to cross-refer relevant elements from the image, history, and the target <a class="link-external link-http" href="http://question.However" rel="external noopener nofollow">this http URL</a>, it remains challenging to reason visual relationships, since the fine-grained object-level information is omitted before co-attentive reasoning. In this paper, we propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation. Specifically, a hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally to obtain the final graph embeddings. A graph attention is further incorporated to dynamically attend to this graph-structured representation at the response reasoning stage. Extensive experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships. The model achieves superior performance over the state-of-the-art methods on the Visual Dialog dataset, increasing MRR from 0.6222 to 0.6447, and recall@1 from 48.48% to 51.22%.

Infer unseen from seen: Relation regularized zero-shot visual dialog

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models

Multi-view semantic understanding for visual dialog

Logic-guided Semantic Representation Learning for Zero-Shot Relation Classification.

New Datasets and Models for Contextual Reasoning in Visual Dialog.

Good Questions Help Zero-Shot Image Reasoning

Relation-Aware Multi-hop Reasoning forVisual Dialog

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

ORD: Object Relationship Discovery for Visual Dialogue Generation

Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

Cross-modal Relational Reasoning Network for Visual Question Answering.

Prior Visual Relationship Reasoning For Visual Question Answering

ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue

Visual coreference resolution in visual dialog using neural module networks

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Multi-Modal Fusion with Multi-Level Attention for Visual Dialog.

Unified Multimodal Model with Unlikelihood Training for Visual Dialog.

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering.