Abstract:We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: <a class="link-external link-https" href="https://johannakarras.github.io/Fashion-VDM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper "Fashion-VDM: Video Diffusion Model for Virtual Try-On" aims to address several key issues in video virtual try-on (VVT): 1. **Generating High-Quality Virtual Try-On Videos**: - Given a clothing image and a person video, the goal is to generate a high-quality virtual try-on video that shows the person wearing the specified clothing while preserving the person's identity and movements. 2. **Maintaining Temporal and Spatial Consistency**: - Existing video virtual try-on methods often lack clothing details and temporal consistency during generation. Specifically, there are inconsistencies between frames generated from different viewpoints, and they fail to realistically simulate the dynamics of fabric (such as wrinkles, folds, and flow). 3. **Handling Diverse Poses and Complex Clothing**: - When there are significant changes in the poses of the person and the clothing, existing methods struggle to handle occluded areas, requiring reasonable inference and generation. 4. **Addressing Data Scarcity Issues**: - Perfect ground truth data (i.e., two different people wearing the same clothing and moving in exactly the same way in two videos) is difficult to obtain and costly. Existing human video data (such as the UBC Fashion dataset) is more scarce and less diverse compared to image data (such as LAION 5B). ### Solution To address the above issues, the authors propose Fashion-VDM, a video virtual try-on method based on a diffusion model. The main innovations include: 1. **Diffusion Model Architecture**: - Using a diffusion model to generate videos, employing 3D convolutions and temporal attention blocks to maintain temporal consistency, capable of generating videos up to 64 frames in a single inference. 2. **Split Classifier-Free Guidance (Split-CFG)**: - Introducing split classifier-free guidance technology, which can independently control multiple conditional signals, thereby improving the fidelity of the clothing and the realism of the video. 3. **Progressive Temporal Training**: - Using a progressive temporal training strategy, gradually increasing the video length from 1 frame to 64 frames, enhancing multi-frame consistency, and reducing training time and memory requirements. 4. **Joint Image and Video Training**: - Combining image and video data during the temporal training phase, increasing data diversity and training stability, particularly improving the synthesis details in occluded areas. Through these innovations, Fashion-VDM has made significant progress in generating high-quality, temporally consistent virtual try-on videos, surpassing existing benchmark methods.

Fashion-VDM: Video Diffusion Model for Virtual Try-On

ViViD: Video Virtual Try-on using Diffusion Models

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Improving Diffusion Models for Virtual Try-on

Toward Realistic Virtual Try-on Through Landmark Guided Shape Matching

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

A Two-stage Personalized Virtual Try-on Framework with Shape Control and Texture Guidance

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

Improving Virtual Try-On with Garment-focused Diffusion Models

M&M VTO: Multi-Garment Virtual Try-On and Editing

ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On

PFDM: Parser-Free Virtual Try-on via Diffusion Model

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

MM-VTON: A Multi-stage Virtual Try-on Method Using Multiple Image Features.

WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on

PG-VTON: A Novel Image-Based Virtual Try-On Method Via Progressive Inference Paradigm

ClothFormer:Taming Video Virtual Try-on in All Module