Abstract:Video captioning is a significant challenging task in computer vision and natural language processing, aiming to automatically describe video content by natural language sentences. Comprehensive understanding of video is the key for accurate video captioning, which needs to not only capture the global content and salient objects in video, but also understand the spatio-temporal relations of objects, including their temporal trajectories and spatial relationships. Thus, it is important for video captioning to capture the objects' relationships both within and across frames. Therefore, in this paper, we propose an object-aware spatio-temporal graph (OSTG) approach for video captioning. It constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects' inter-frame dynamics, and the spatial graphs represent objects' intra-frame interactive relationships. The main novelties and advantages are: (1) Bidirectional temporal alignment: Bidirectional temporal graph is constructed along and reversely along the temporal order to perform bidirectional temporal alignment for objects across different frames, which provides complementary clues to capture the inter-frame temporal trajectories for each salient object. (2) Graph based spatial relation learning: Spatial relation graph is constructed among objects in each frame by considering their relative spatial locations and semantic correlations, which is exploited to learn relation features that encode intra-frame relationships for salient objects. (3) Object-aware feature aggregation: Trainable VLAD (vector of locally aggregated descriptors) models are deployed to perform object-aware feature aggregation on objects' local features, which learn discriminative aggregated representations for better video captioning. A hierarchical attention mechanism is also developed to distinguish contributions of different object instances. Experiments on two widely-used datasets, MSR-VTT and MSVD, demonstrate our proposed approach achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.

Motion-Aware Video Paragraph Captioning Via Exploring Object-Centered Internal Knowledge

Exploring Object-Centered External Knowledge for Fine-Grained Video Paragraph Captioning

Video Paragraph Captioning As a Text Summarization Task

Motion Guided Region Message Passing for Video Captioning

Sparse Frame Grouping Network with Action Centered for Untrimmed Video Paragraph Captioning

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

Learning topic emotion and logical semantic for video paragraph captioning

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Video Captioning Via Relation-Aware Graph Learning

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Video Captioning with Transferred Semantic Attributes.

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Mart: Memory-Augmented Recurrent Transformer For Coherent Video Paragraph Captioning

Motion Guided Spatial Attention for Video Captioning.

Video Captioning With Attention-Based LSTM and Semantic Consistency

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Object-Aware Aggregation with Bidirectional Temporal Graph for Video Captioning.