Vlogger: Make Your Dream A Vlog

Shaobin Zhuang,Kunchang Li,Xinyuan Chen,Yaohui Wang,Ziwei Liu,Yu Qiao,Yali Wang

2024-01-18

Abstract:In this work, we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages Large Language Model (LLM) as Director and decomposes a long video generation task of vlog into four key stages, where we invoke various foundation models to play the critical roles of vlog professionals, including (1) Script, (2) Actor, (3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. Moreover, we introduce a novel video diffusion model, ShowMaker, which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts, it can effectively enhance spatial-temporal coherence in the snippet. Besides, we design a concise mixed training paradigm for ShowMaker, boosting its capacity for both T2V generation and prediction. Finally, the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor. The code and model is all available at

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenge of generating high-quality long video blogs (vlogs). Specifically, it proposes a general AI system named **Vlogger** for automatically generating minute-long video blogs. Unlike most existing methods that generate short videos, Vlogger tackles the complex narrative and multi-scene coherence issues in long video generation by mimicking the professional process of human vlog creation. #### Main Issues: 1. **Coherence in Long Video Generation**: Existing video diffusion models primarily generate short videos of a few seconds, whereas vlogs typically contain complex storylines and diverse scenes, posing a challenge to current methods. 2. **Data Requirements and Training Burden**: Long video generation requires a large amount of annotated data for training, which is difficult in practical applications. Additionally, iterative generation methods can lead to content incoherence during scene transitions. #### Solutions: - **Stage-wise Planning and Generation**: Vlogger uses a large language model (LLM) as a "director" to decompose the long video generation task into four key stages (script writing, character design, video shooting, and dubbing), achieving video coherence through the collaborative work of these stages. - **Introduction of the ShowMaker Model**: To enhance the spatial and temporal coherence of video segments, a new video diffusion model called ShowMaker is proposed. It uses script text and actor images as prompts to effectively generate video segments of controllable duration. - **Hybrid Training Paradigm**: A hybrid training paradigm is adopted, combining text-to-video generation and prediction modes to enhance the model's flexibility and effectiveness during the generation process.

Vlogger: Make Your Dream A Vlog

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Compositional Video Generation as Flow Equalization

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

Compositional 3D-aware Video Generation with LLM Director

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification