Abstract:Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the difficulties encountered in video editing, particularly for beginners, where expressing and realizing their editing intentions is very challenging and time-consuming. Specifically, the paper attempts to solve this problem through the following approaches: 1. **Multimodal Expression**: Utilizing two commonly used human expression methods, natural language (NL) and hand-drawn sketches, to enable video editors to express their editing intentions more intuitively. 2. **System Design**: Proposing a multimodal video editing system named ExpressEdit, which supports expressing editing commands through natural language text and hand-drawn sketches on video frames. 3. **Technical Implementation**: Leveraging the technical pipeline of computer vision and large language models, the system can parse and understand the temporal position, spatial position, and editing operations and their parameters in natural language text and hand-drawn sketches, and execute the corresponding editing tasks accordingly. In this way, the ExpressEdit system not only makes it easier for beginners to express their editing intentions but also enhances their efficiency and creativity in realizing these intentions.

ExpressEdit: Video Editing with Natural Language and Sketching

Editing like Humans: A Contextual, Multimodal Framework for Automated Video Editing

EditScribe: Non-Visual Image Editing with Natural Language Verification Loops

Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers

Text-based editing of talking-head video

VideoMap: Supporting Video Editing Exploration, Brainstorming, and Prototyping in the Latent Space

Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era

Analogies based video editing

Edit As You Wish: Video Caption Editing with Multi-grained User Control

The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing

Cross-Attention and Seamless Replacement of Latent Prompts for High-Definition Image-Driven Video Editing

Neutral Editing Framework for Diffusion-based Video Editing

Iterative Motion Editing with Natural Language

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

MagicStick: Controllable Video Editing via Control Handle Transformations

VCoME: Verbal Video Composition with Multimodal Editing Effects

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing