Multi-level video captioning method based on semantic space

Xiao Yao,Yuanlin Zeng,Min Gu,Ruxi Yuan,Jie Li,Junyi Ge
DOI: https://doi.org/10.1007/s11042-024-18372-z
IF: 2.577
2024-02-08
Multimedia Tools and Applications
Abstract:Video captioning is designed to generate natural language descriptions based on video content. Traditional methods extract visual features and interactive relationship features between objects, but the problem of video feature isolation and semantic hierarchy is ignored. This paper proposes a Multi-Level Video Captioning Method based on semantic space (S-MLM) to solve the above problems. S-MLM extracts different levels of visual elements and visual relationships, and the visual information of different levels is aggregated layer by layer to complete the generation of low-level to high-level visual features. The multi-level structure semantic graph is constructed from the semantic point of view. It does not rely on external knowledge bases, and uses its own information as guidance to enhance feature representation and improve semantic understanding. We conduct experiments on MSVD and MSR-VTT datasets, and the experimental results show that the performance of video captioning is further improved.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?