Abstract:Inspired by the success of the text-to-image (T2I) generation task, many researchers are devoting themselves to the text-to-video (T2V) generation task. Most of the T2V frameworks usually inherit from the T2I model and add extra-temporal layers of training to generate dynamic videos, which can be viewed as a fine-tuning task. However, the traditional 3D-Unet is a serial mode and the temporal layers follow the spatial layers, which will result in high GPU memory and training time consumption according to its serial feature flow. We believe that this serial mode will bring more training costs with the large diffusion model and massive datasets, which are not environmentally friendly and not suitable for the development of the T2V. Therefore, we propose a highly efficient spatial-temporal parallel training paradigm for T2V tasks, named Mobius. In our 3D-Unet, the temporal layers and spatial layers are parallel, which optimizes the feature flow and backpropagation. The Mobius will save 24% GPU memory and 12% training time, which can greatly improve the T2V fine-tuning task and provide a novel insight for the AIGC community. We will release our codes in the future.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the high GPU memory consumption and long training time in the traditional text - to - video (T2V) generation tasks. Specifically: 1. **Problems with the serial mode of the traditional 3D - Unet**: Most existing T2V frameworks inherit the text - to - image (T2I) model and add additional temporal layers to generate dynamic videos. This design is serial, that is, the temporal layers follow the spatial layers. This leads to the serial processing of the feature flow, thereby increasing the consumption of GPU memory and training time. 2. **Unfriendly to the environment and waste of resources**: With the expansion of the scale of diffusion models and the increase of data sets, the traditional serial mode will bring higher training costs, which is not only unfriendly to the environment but also not conducive to the development of T2V tasks. To solve these problems, the author proposes an efficient spatio - temporal parallel training paradigm - Mobius. This paradigm parallelizes the temporal and spatial layers in 3D - Unet, optimizing the feature flow and back - propagation process, thereby significantly reducing GPU memory usage and training time. ### Main contributions: - **Reducing resource consumption**: Compared with traditional methods, Mobius can save 24% of GPU memory and 12% of training time. - **Improving training efficiency**: By parallel processing the temporal and spatial layers, Mobius improves the training efficiency of T2V tasks. - **Providing new insights**: This research provides new ideas for the AIGC community regarding large - scale model fine - tuning. ### Formula representation: Some of the key formulas involved in the paper are as follows: - **Forward process of Markov chain**: \[ q(x_t|x_{t - 1})=\mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t - 1}, \beta_tI), \quad t = 1, \ldots, T \] where \(\beta_t\in(0, 1)\) is a trainable parameter. - **Backward process**: \[ p_\theta(x_{t - 1}|x_t)=\mathcal{N}(x_{t - 1}; \mu_\theta(x_t, t)), \quad t = 1, \ldots, T \] - **Objective function**: \[ \mathbb{E}_{x, \epsilon\sim\mathcal{N}(0, 1)}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|_2^2\right] \] These formulas show the basic principles of the diffusion model, and Mobius achieves a more efficient training process by improving these formulas.

Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Seer: Language Instructed Video Prediction with Latent Diffusion Models.

MoViNets: Mobile Video Networks for Efficient Video Recognition

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers

Mimir: Improving Video Diffusion Models for Precise Text Understanding

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

ControlVideo: Training-free Controllable Text-to-Video Generation

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

VideoTetris: Towards Compositional Text-to-Video Generation

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

A Multigrid Method for Efficiently Training Video Models

An Efficient 2D Method for Training Super-Large Deep Learning Models

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation