RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning

Zheng, Lihuan
DOI: https://doi.org/10.1007/s00371-024-03350-1
IF: 2.835
2024-04-18
The Visual Computer
Abstract:The video captioning task is generating description sentences by learning semantic information. It has a wide range of applications in areas such as video retrieval, automatic generation of subtitles and blind assistance. Visual semantic information plays a decisive role in video captioning. However, traditional methods are relatively rough for video feature modeling, failing to harness local and global features to understand temporal and spatial relationships. In this paper, we propose a video captioning model based on the Transformer and GCN network called "Relation-Enhanced Spatial–Temporal Hierarchical Transformer" (RESTHT). To address the above issues, we present a spatial–temporal hierarchical network framework to jointly model local and global features in terms of both time and space. For temporal modeling, our model learns the direct interactions between diverse video features and sentence features in the temporal sequence via the pre-trained GPT2, and the global feature construction encourages it to capture essential and relevant information. For spatial modeling, we use self-attention and GCN networks to learn the spatial relationship from appearance and motion perspectives jointly. Through spatial–temporal modeling, our method can comprehend the global time–space relationships of complex events in videos and catch the interaction between different objects to generate more accurate descriptions applicable to universal video captioning tasks. We conducted experiments on two widely used datasets, and especially in the MSVD dataset, our model improves the score of CIDEr by 6.1 compared to the baseline and excels present methods by 13. The results verify that our model can fully model the temporal and spatial relationship and outperforms other related models.
computer science, software engineering
What problem does this paper attempt to address?