Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

Yu Lu,Feiyue Ni,Haofan Wang,Xiaofeng Guo,Linchao Zhu,Zongxin Yang,Ruihua Song,Lele Cheng,Yi Yang
DOI: https://doi.org/10.1109/tmm.2023.3296944
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Illustrating a multi-sentence story with visual content is a significant challenge in multimedia research. While previous works have focused on sequential story-to-visual representations at the image level or representing a single sentence with a video clip, illustrating a long multi-sentence story with coherent videos remains an under-explored area. In this paper, we propose the task of video-based story illustration that focuses on the goal of visually illustrating a story with retrieved video clips. To support this task, we first create a large-scale dataset of coherent video stories in each sample, consisting of 85K narrative stories with 60 pairs of consistent clips and texts. We then propose the Story Context-Enhanced Model, which leverages local and global contextual information within the story, inspired by sequence modeling in language understanding. Through comprehensive quantitative experiments, we demonstrate the effectiveness of our baseline model. In addition, qualitative results and detailed user studies reveal that our method can retrieve coherent video sequences from stories. The dataset and code will be made publicly at https://nfy-dot.github.io/CVSV-dataset/.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?