A Dataset with Multi-Modal Information and Multi-Granularity Descriptions for Video Captioning

Mingrui Xiao,Zijian Zeng,Yue Zheng,Shu Yang,Yali Li,Shengjin Wang
DOI: https://doi.org/10.1109/icme57554.2024.10688196
2024-01-01
Abstract:Video captioning aims to generate natural language descriptions automatically from videos. While datasets like MSVD and MSR-VTT have driven research in recent years, they predominantly focus on visual features and describe simple actions, ignoring audio, text, and other modal information. Which, however, is limited, because multi-modal information plays an important role in generating accurate captions. In this study, we introduce a dataset, News-11k, which includes over 150,000 captions with multi-modal information from more than 11,000 selected news video clips. We annotate multi-granularity captions from three perspectives: coarse-grained, medium-grained, and fine-grained captions. Due to the characteristics of news videos, generating accurate captions on our dataset requires multi-modal understanding ability. Therefore, we propose a baseline model for multi-modal video captioning. To address the challenge of multi-modal information fusion, we devise the concatenating modal embedding strategy. Experiments indicate that multi-modal information significantly enhances the understanding of the deeper semantics in videos. Data will be made available on https://github.com/David-Zeng-Zijian/News-11k.
What problem does this paper attempt to address?