Abstract:As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at <a class="link-external link-https" href="https://ylingfeng.github.io/AVT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently extract key information from user - generated videos, while removing redundant segments and combining valuable segments into a coherent story. Specifically, the paper proposes a new task - Video Trimming (VT), whose goal is not only to select video segments with high saliency, but also to filter out useless segments and recombine the remaining segments in a logical and coherent manner to form the final video output. This task aims to overcome the limitations of existing video processing methods (such as highlight detection, moment retrieval and video summarization) that only focus on content extraction while ignoring the relationships between paragraphs and the overall narrative coherence. The main contributions of the paper are as follows: 1. **Introduce the video trimming task for the first time**: Extract key intentions from long - videos and generate condensed videos with a coherent storyline. 2. **Propose an Agent - based Video Trimming algorithm (AVT)**: This algorithm converts video content into a structured description through three stages of video structuring, segment filtering and story combination, filters out useless segments, and combines the selected segments into a coherent final video. 3. **Create a new video trimming benchmark dataset**: This dataset contains user videos crawled from the Internet, and uses a combination of video evaluation agents and manual evaluation methods to evaluate video quality. 4. **Demonstrate superior performance in video trimming and zero - sample highlight detection tasks**: Through user studies and multiple benchmark tests, the effectiveness of the method is proved. These contributions together solve the problems existing in current video processing technologies and provide a more comprehensive and efficient video content editing method.

Agent-based Video Trimming

Skimming and Scanning for Untrimmed Video Action Recognition

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Video abstraction based on the visual attention model and online clustering

TruNet: Short Videos Generation from Long Videos via Story-Preserving Truncation

Video Abstraction via Attention Model and On-Line Clustering

Video Editing for Video Retrieval

Annotation-Efficient Untrimmed Video Action Recognition

Attention-guided Temporally Coherent Video Object Matting

Video Action Segmentation Via Contextually Refined Temporal Keypoints

Sparse Frame Grouping Network with Action Centered for Untrimmed Video Paragraph Captioning

Video Editing with Temporal, Spatial and Appearance Consistency

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Video Object Extraction Using Extended Intelligent Scissors

End-to-End Video Instance Segmentation with Transformers

CLIPRerank: An Extremely Simple Method for Improving Ad-hoc Video Search