Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task

Yiran Yang,Jinchao Zhang,Ying Deng,Jie Zhou
2024-07-23
Abstract:Inspired by the success of the text-to-image (T2I) generation task, many researchers are devoting themselves to the text-to-video (T2V) generation task. Most of the T2V frameworks usually inherit from the T2I model and add extra-temporal layers of training to generate dynamic videos, which can be viewed as a fine-tuning task. However, the traditional 3D-Unet is a serial mode and the temporal layers follow the spatial layers, which will result in high GPU memory and training time consumption according to its serial feature flow. We believe that this serial mode will bring more training costs with the large diffusion model and massive datasets, which are not environmentally friendly and not suitable for the development of the T2V. Therefore, we propose a highly efficient spatial-temporal parallel training paradigm for T2V tasks, named Mobius. In our 3D-Unet, the temporal layers and spatial layers are parallel, which optimizes the feature flow and backpropagation. The Mobius will save 24% GPU memory and 12% training time, which can greatly improve the T2V fine-tuning task and provide a novel insight for the AIGC community. We will release our codes in the future.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the high GPU memory consumption and long training time in the traditional text - to - video (T2V) generation tasks. Specifically: 1. **Problems with the serial mode of the traditional 3D - Unet**: Most existing T2V frameworks inherit the text - to - image (T2I) model and add additional temporal layers to generate dynamic videos. This design is serial, that is, the temporal layers follow the spatial layers. This leads to the serial processing of the feature flow, thereby increasing the consumption of GPU memory and training time. 2. **Unfriendly to the environment and waste of resources**: With the expansion of the scale of diffusion models and the increase of data sets, the traditional serial mode will bring higher training costs, which is not only unfriendly to the environment but also not conducive to the development of T2V tasks. To solve these problems, the author proposes an efficient spatio - temporal parallel training paradigm - Mobius. This paradigm parallelizes the temporal and spatial layers in 3D - Unet, optimizing the feature flow and back - propagation process, thereby significantly reducing GPU memory usage and training time. ### Main contributions: - **Reducing resource consumption**: Compared with traditional methods, Mobius can save 24% of GPU memory and 12% of training time. - **Improving training efficiency**: By parallel processing the temporal and spatial layers, Mobius improves the training efficiency of T2V tasks. - **Providing new insights**: This research provides new ideas for the AIGC community regarding large - scale model fine - tuning. ### Formula representation: Some of the key formulas involved in the paper are as follows: - **Forward process of Markov chain**: \[ q(x_t|x_{t - 1})=\mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t - 1}, \beta_tI), \quad t = 1, \ldots, T \] where \(\beta_t\in(0, 1)\) is a trainable parameter. - **Backward process**: \[ p_\theta(x_{t - 1}|x_t)=\mathcal{N}(x_{t - 1}; \mu_\theta(x_t, t)), \quad t = 1, \ldots, T \] - **Objective function**: \[ \mathbb{E}_{x, \epsilon\sim\mathcal{N}(0, 1)}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|_2^2\right] \] These formulas show the basic principles of the diffusion model, and Mobius achieves a more efficient training process by improving these formulas.