Johanna Karras,Yingwei Li,Nan Liu,Luyang Zhu,Innfarn Yoo,Andreas Lugmayr,Chris Lee,Ira Kemelmacher-Shlizerman
Abstract:We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: <a class="link-external link-https" href="https://johannakarras.github.io/Fashion-VDM" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
The paper "Fashion-VDM: Video Diffusion Model for Virtual Try-On" aims to address several key issues in video virtual try-on (VVT):
1. **Generating High-Quality Virtual Try-On Videos**:
- Given a clothing image and a person video, the goal is to generate a high-quality virtual try-on video that shows the person wearing the specified clothing while preserving the person's identity and movements.
2. **Maintaining Temporal and Spatial Consistency**:
- Existing video virtual try-on methods often lack clothing details and temporal consistency during generation. Specifically, there are inconsistencies between frames generated from different viewpoints, and they fail to realistically simulate the dynamics of fabric (such as wrinkles, folds, and flow).
3. **Handling Diverse Poses and Complex Clothing**:
- When there are significant changes in the poses of the person and the clothing, existing methods struggle to handle occluded areas, requiring reasonable inference and generation.
4. **Addressing Data Scarcity Issues**:
- Perfect ground truth data (i.e., two different people wearing the same clothing and moving in exactly the same way in two videos) is difficult to obtain and costly. Existing human video data (such as the UBC Fashion dataset) is more scarce and less diverse compared to image data (such as LAION 5B).
### Solution
To address the above issues, the authors propose Fashion-VDM, a video virtual try-on method based on a diffusion model. The main innovations include:
1. **Diffusion Model Architecture**:
- Using a diffusion model to generate videos, employing 3D convolutions and temporal attention blocks to maintain temporal consistency, capable of generating videos up to 64 frames in a single inference.
2. **Split Classifier-Free Guidance (Split-CFG)**:
- Introducing split classifier-free guidance technology, which can independently control multiple conditional signals, thereby improving the fidelity of the clothing and the realism of the video.
3. **Progressive Temporal Training**:
- Using a progressive temporal training strategy, gradually increasing the video length from 1 frame to 64 frames, enhancing multi-frame consistency, and reducing training time and memory requirements.
4. **Joint Image and Video Training**:
- Combining image and video data during the temporal training phase, increasing data diversity and training stability, particularly improving the synthesis details in occluded areas.
Through these innovations, Fashion-VDM has made significant progress in generating high-quality, temporally consistent virtual try-on videos, surpassing existing benchmark methods.