Agent-based Video Trimming

Lingfeng Yang,Zhenyuan Chen,Xiang Li,Peiyang Jia,Liangqu Long,Jian Yang
2024-12-13
Abstract:As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at <a class="link-external link-https" href="https://ylingfeng.github.io/AVT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently extract key information from user - generated videos, while removing redundant segments and combining valuable segments into a coherent story. Specifically, the paper proposes a new task - Video Trimming (VT), whose goal is not only to select video segments with high saliency, but also to filter out useless segments and recombine the remaining segments in a logical and coherent manner to form the final video output. This task aims to overcome the limitations of existing video processing methods (such as highlight detection, moment retrieval and video summarization) that only focus on content extraction while ignoring the relationships between paragraphs and the overall narrative coherence. The main contributions of the paper are as follows: 1. **Introduce the video trimming task for the first time**: Extract key intentions from long - videos and generate condensed videos with a coherent storyline. 2. **Propose an Agent - based Video Trimming algorithm (AVT)**: This algorithm converts video content into a structured description through three stages of video structuring, segment filtering and story combination, filters out useless segments, and combines the selected segments into a coherent final video. 3. **Create a new video trimming benchmark dataset**: This dataset contains user videos crawled from the Internet, and uses a combination of video evaluation agents and manual evaluation methods to evaluate video quality. 4. **Demonstrate superior performance in video trimming and zero - sample highlight detection tasks**: Through user studies and multiple benchmark tests, the effectiveness of the method is proved. These contributions together solve the problems existing in current video processing technologies and provide a more comprehensive and efficient video content editing method.