Subject-Oriented Video Captioning

Yunchuan Ma,Chang Teng,Yuankai Qi,Guorong Li,Laiyu Qing,Qi Wu,Qingming Huang
2023-12-21
Abstract:Describing video content according to users' needs is a long-held goal. Although existing video captioning methods have made significant progress, the generated captions may not focus on the entity that users are particularly interested in. To address this problem, we propose a new video captioning task, subject-oriented video captioning, which allows users to specify the describing target via a bounding box. To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets: MSVD and MSRVTT, by annotating subjects in each video for each caption. These datasets pave the way for future technique development. As the first attempt, we evaluate four state-of-the-art general video captioning models, and have observed a large performance drop. We then explore several strategies to enable them to describe the desired target. Experimental results show obvious improvement, but there is still a large room for further exploration in this field.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses a specific issue in the field of video captioning—how to allow users to specify objects or entities of interest in a video and generate descriptions about those objects. Traditional methods often fail to focus on entities that users are particularly interested in when describing video content, which limits their application in real-world scenarios. To solve this problem, the paper proposes a new task—subject-oriented video captioning, which allows users to specify the target of the description through a bounding box. To support this new task, the authors constructed two new datasets based on two widely used video captioning datasets, MSVD and MSRVTT, annotating the targets in each video segment. Additionally, the paper evaluates the performance of four state-of-the-art general video captioning models on this new task and observes a significant decline in their performance, indicating that existing technologies struggle to meet the needs of subject-oriented video captioning. Therefore, the authors propose some preliminary attempts to integrate entity features into existing Transformer and LSTM-based frameworks. These attempts have achieved some success and may provide a foundation for future research. The main contributions of the paper can be summarized as follows: 1. Proposed the new task of subject-oriented video captioning and constructed corresponding datasets based on MSVD and MSRVTT. 2. Evaluated four representative general video captioning methods and found a significant decline in their performance on the new task, indicating the need for new technologies. 3. Conducted extensive preliminary attempts to address the new task and achieved considerable improvements, providing a strong baseline for subsequent research.