Abstract:In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to achieve automatic video editing in general scenarios, so as to reduce the workload of manual editing and decrease the dependence on professional editors. Existing automatic editing systems mainly focus on specific scenarios or events (such as football match broadcasts), while there is less research on general editing tasks (such as movie or Vlog editing, covering various scenarios and events). There are great challenges in applying event - driven editing methods to general scenarios. To solve these problems, the author proposes an automatic video editing method based on Reinforcement Learning (RL) and uses a pre - trained Vision - Language Model (VLM) to extract editing - related representations. Specifically: 1. **Problem Definition**: - The author defines the editing task as predicting the multi - dimensional attributes of subsequent shots. Given the historical context shots $\{S_0, S_1, \dots, S_n\}$, the goal is to predict the attributes $\{A_{n + 1}, \dots, A_{n + M}\}$ of the subsequent shots $\{S_{n + 1}, \dots, S_{n + M}\}$. - The attribute $A_i=(a_1, \dots, a_8)$ of each shot is an 8 - dimensional vector, and each element represents a type of attribute, such as shot size, angle, type, etc. 2. **Method Framework**: - **First Stage: Context Representation Extraction**: - Use a pre - trained Vision - Language Model (such as XCLIP) to extract editing - related representations without additional manual labels. - For each attribute $a_i$, construct a set of text prompts, and calculate the prompt embedding $E_t^i$ through the text encoder of XCLIP and extract the visual embedding $E_v^j$ through the video encoder. - Calculate the similarity $D_{j, i}=E_t^i E_v^j$ and apply the softmax function to obtain the attribute distribution $p_{j, i}$. - **Second Stage: Virtual Editor Training**: - Propose an editing framework based on Reinforcement Learning to optimize the virtual editor so that it can make better sequential editing decisions. - Define the state $E$, action $O$ and reward $r$, and train the network through the actor - critic scheme. - The state $E$ contains the attribute distribution of the context shots; the action $O$ is the editing decision; the reward $r$ measures the quality of the action. 3. **Experimental Verification**: - Conduct experiments on the AVE dataset to evaluate the performance of the proposed method in attribute prediction and retrieval tasks. - Introduce new evaluation metrics, such as two - step retrieval accuracy (2 - rank 1) and two - step overall attribute accuracy (2 - Acc), to verify the sequential decision - making ability. In summary, this paper aims to solve the challenges of automatic video editing in general scenarios by combining pre - trained Vision - Language models and Reinforcement Learning techniques, thereby generating higher - quality automated edited videos.

A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

Edit As You Wish: Video Caption Editing with Multi-grained User Control

Towards Data-Driven Automatic Video Editing

M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers

Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit

LOVECon: Text-driven Training-Free Long Video Editing with ControlNet

Editing like Humans: A Contextual, Multimodal Framework for Automated Video Editing

Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

Context-Aware Talking-Head Video Editing

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

Edit3K: Universal Representation Learning for Video Editing Components

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Prompting Visual-Language Models for Efficient Video Understanding

EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

Reframe Anything: LLM Agent for Open World Video Reframing

Video Editing for Video Retrieval

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback