A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

Panwen Hu,Nan Xiao,Feifei Li,Yongquan Chen,Rui Huang
2024-11-08
Abstract:In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to achieve automatic video editing in general scenarios, so as to reduce the workload of manual editing and decrease the dependence on professional editors. Existing automatic editing systems mainly focus on specific scenarios or events (such as football match broadcasts), while there is less research on general editing tasks (such as movie or Vlog editing, covering various scenarios and events). There are great challenges in applying event - driven editing methods to general scenarios. To solve these problems, the author proposes an automatic video editing method based on Reinforcement Learning (RL) and uses a pre - trained Vision - Language Model (VLM) to extract editing - related representations. Specifically: 1. **Problem Definition**: - The author defines the editing task as predicting the multi - dimensional attributes of subsequent shots. Given the historical context shots $\{S_0, S_1, \dots, S_n\}$, the goal is to predict the attributes $\{A_{n + 1}, \dots, A_{n + M}\}$ of the subsequent shots $\{S_{n + 1}, \dots, S_{n + M}\}$. - The attribute $A_i=(a_1, \dots, a_8)$ of each shot is an 8 - dimensional vector, and each element represents a type of attribute, such as shot size, angle, type, etc. 2. **Method Framework**: - **First Stage: Context Representation Extraction**: - Use a pre - trained Vision - Language Model (such as XCLIP) to extract editing - related representations without additional manual labels. - For each attribute $a_i$, construct a set of text prompts, and calculate the prompt embedding $E_t^i$ through the text encoder of XCLIP and extract the visual embedding $E_v^j$ through the video encoder. - Calculate the similarity $D_{j, i}=E_t^i E_v^j$ and apply the softmax function to obtain the attribute distribution $p_{j, i}$. - **Second Stage: Virtual Editor Training**: - Propose an editing framework based on Reinforcement Learning to optimize the virtual editor so that it can make better sequential editing decisions. - Define the state $E$, action $O$ and reward $r$, and train the network through the actor - critic scheme. - The state $E$ contains the attribute distribution of the context shots; the action $O$ is the editing decision; the reward $r$ measures the quality of the action. 3. **Experimental Verification**: - Conduct experiments on the AVE dataset to evaluate the performance of the proposed method in attribute prediction and retrieval tasks. - Introduce new evaluation metrics, such as two - step retrieval accuracy (2 - rank 1) and two - step overall attribute accuracy (2 - Acc), to verify the sequential decision - making ability. In summary, this paper aims to solve the challenges of automatic video editing in general scenarios by combining pre - trained Vision - Language models and Reinforcement Learning techniques, thereby generating higher - quality automated edited videos.