Abstract:The video captioning task is generating description sentences by learning semantic information. It has a wide range of applications in areas such as video retrieval, automatic generation of subtitles and blind assistance. Visual semantic information plays a decisive role in video captioning. However, traditional methods are relatively rough for video feature modeling, failing to harness local and global features to understand temporal and spatial relationships. In this paper, we propose a video captioning model based on the Transformer and GCN network called "Relation-Enhanced Spatial–Temporal Hierarchical Transformer" (RESTHT). To address the above issues, we present a spatial–temporal hierarchical network framework to jointly model local and global features in terms of both time and space. For temporal modeling, our model learns the direct interactions between diverse video features and sentence features in the temporal sequence via the pre-trained GPT2, and the global feature construction encourages it to capture essential and relevant information. For spatial modeling, we use self-attention and GCN networks to learn the spatial relationship from appearance and motion perspectives jointly. Through spatial–temporal modeling, our method can comprehend the global time–space relationships of complex events in videos and catch the interaction between different objects to generate more accurate descriptions applicable to universal video captioning tasks. We conducted experiments on two widely used datasets, and especially in the MSVD dataset, our model improves the score of CIDEr by 6.1 compared to the baseline and excels present methods by 13. The results verify that our model can fully model the temporal and spatial relationship and outperforms other related models.

Dual Attentional Transformer for Video Visual Relation Prediction

Visual relationship detection with a deep convolutional relationship network

Video Relation Detection via Tracklet based Visual Transformer

Video Relation Detection with Spatio-Temporal Graph

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Video Relation Detection Via Multiple Hypothesis Association.

Hierarchical Graph Attention Network for Visual Relationship Detection

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Video Visual Relation Detection Via Multi-modal Feature Fusion

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Hierarchical Visual Relationship Detection

VrdONE: One-stage Video Visual Relation Detection

Visual Translation Embedding Network for Visual Relation Detection

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

Online video visual relation detection with hierarchical multi-modal fusion

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning

Visual Relationship Detection: A Survey

Visual Relationship Detection With Image Position and Feature Information Embedding and Fusion

Localize, Assemble, And Predicate: Contextual Object Proposal Embedding For Visual Relation Detection