Shot Retrieval and Assembly with Text Script for Video Montage Generation.

Guoxing Yang,Haoyu Lu,Zelong Sun,Zhiwu Lu
DOI: https://doi.org/10.1145/3591106.3592247
2023-01-01
Abstract:With the development of video sharing websites, numerous users desire to create their own attractive video montages. However, it is difficult for inexperienced users to create well-edited video montages due to the lack of professional expertise. In the meantime, it is time-consuming even for experts to create video montages of high quality, which requires effectively selecting shots from abundant candidates and assembling them together. Instead of manual creation, various automatic methods have been proposed for video montage generation, which typically take a single sentence as input for text-to-shot retrieval, and ignore the semantic cross-sentence coherence given complicated text script of multiple sentences. To overcome this drawback, we propose a novel model for video montage generation by retrieving and assembling shots with arbitrary text scripts. To this end, a sequence consistency transformer is devised for cross-sentence coherence modeling. More importantly, with this transformer, two novel sequence-level tasks are defined for sentence-shot alignment in sequence-level: Cross-Modal Sequence Matching (CMSM) task, and Chaotic Sequence Recovering (CSR) task. To facilitate the research on video montage generation, we construct a new, highly-varied dataset which collects thousands of video-script pairs in documentary. Extensive experiments on the constructed dataset demonstrate the superior performance of the proposed model. The dataset and generated video demos are available at https://github.com/RATVDemo/RATV.
What problem does this paper attempt to address?