Abstract:With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well <a class="link-external link-http" href="http://explored.Existing" rel="external noopener nofollow">this http URL</a> visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the <a class="link-external link-http" href="http://response.Some" rel="external noopener nofollow">this http URL</a> recent methods tackle the co-reference resolution problem using co-attention mechanism to cross-refer relevant elements from the image, history, and the target <a class="link-external link-http" href="http://question.However" rel="external noopener nofollow">this http URL</a>, it remains challenging to reason visual relationships, since the fine-grained object-level information is omitted before co-attentive reasoning. In this paper, we propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation. Specifically, a hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally to obtain the final graph embeddings. A graph attention is further incorporated to dynamically attend to this graph-structured representation at the response reasoning stage. Extensive experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships. The model achieves superior performance over the state-of-the-art methods on the Visual Dialog dataset, increasing MRR from 0.6222 to 0.6447, and recall@1 from 48.48% to 51.22%.

Language-enhanced object reasoning networks for video moment retrieval with text query

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Object-aware Video-language Pre-training for Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

ReGR: Relation-aware graph reasoning framework for video question answering

Semantic Modulation Based Residual Network for Temporal Language Queries Grounding in Video.

Attentive Moment Retrieval in Videos

Reasoning-Enhanced Object-Centric Learning for Videos

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

Natural Language Video Localization with Learnable Moment Proposals

DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video

Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism

Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Video Moment Retrieval with Noisy Labels

ORD: Object Relationship Discovery for Visual Dialogue Generation