Abstract:Video captioning is a significant challenging task in computer vision and natural language processing, aiming to automatically describe video content by natural language sentences. Comprehensive understanding of video is the key for accurate video captioning, which needs to not only capture the global content and salient objects in video, but also understand the spatio-temporal relations of objects, including their temporal trajectories and spatial relationships. Thus, it is important for video captioning to capture the objects' relationships both within and across frames. Therefore, in this paper, we propose an object-aware spatio-temporal graph (OSTG) approach for video captioning. It constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects' inter-frame dynamics, and the spatial graphs represent objects' intra-frame interactive relationships. The main novelties and advantages are: (1) Bidirectional temporal alignment: Bidirectional temporal graph is constructed along and reversely along the temporal order to perform bidirectional temporal alignment for objects across different frames, which provides complementary clues to capture the inter-frame temporal trajectories for each salient object. (2) Graph based spatial relation learning: Spatial relation graph is constructed among objects in each frame by considering their relative spatial locations and semantic correlations, which is exploited to learn relation features that encode intra-frame relationships for salient objects. (3) Object-aware feature aggregation: Trainable VLAD (vector of locally aggregated descriptors) models are deployed to perform object-aware feature aggregation on objects' local features, which learn discriminative aggregated representations for better video captioning. A hierarchical attention mechanism is also developed to distinguish contributions of different object instances. Experiments on two widely-used datasets, MSR-VTT and MSVD, demonstrate our proposed approach achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.

Optimizing multi-graph learning: towards a unified video annotation scheme.

Multi-Modality Transfer Based on Multi-Graph Optimization for Domain Adaptive Video Concept Annotation

Exploring Multi-Modality Structure for Cross Domain Adaptation in Video Concept Annotation

A Generic Framework for Video Annotation Via Semi-Supervised Learning.

Graph-Based Semi-Supervised Learning with Multi-Label

Classification-Then-Grounding: Reformulating Video Scene Graphs As Temporal Bipartite Graphs

Multi-View Graph Embedding Learning for Image Co-Segmentation and Co-Localization

Towards Multi-Semantic Image Annotation with Graph Regularized Exclusive Group Lasso

Exploiting Semantic And Visual Context For Effective Video Annotation

Sequence Multi-Labeling: A Unified Video Annotation Scheme with Spatial and Temporal Context

Central Attention with Multi-Graphs for Image Annotation

Discriminative Latent Semantic Graph for Video Captioning

Ensemble Multi-Instance Multi-Label Learning Approach for Video Annotation Task

Online Multi-Label Active Annotation

CMGNet: Collaborative multi-modal graph network for video captioning

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

Object Relational Graph with Teacher-Recommended Learning for Video Captioning

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding

Adversarial Reinforcement Learning With Object-Scene Relational Graph for Video Captioning