Abstract:Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: <a class="link-external link-https" href="https://becauseimbatman0.github.io/ViViD" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to address the problem of video virtual try-on. Specifically, existing image virtual try-on techniques, when directly applied to videos, lead to temporal inconsistency issues, while existing video virtual try-on methods, although capable of generating videos, produce low visual quality and blurriness. To solve these problems, the authors propose ViViD (Video Virtual Try-on using Diffusion Models), a new framework utilizing diffusion models aimed at generating high-quality, natural, and coherent video virtual try-on results. ### Main Issues 1. **Temporal Consistency Issue**: Directly applying image virtual try-on techniques to videos results in temporal inconsistencies, causing flickering and artifacts. 2. **Low Visual Quality**: Existing video virtual try-on methods generate videos with low quality and unclear details. 3. **Insufficient Dataset**: The lack of high-quality, high-resolution, and diverse video virtual try-on datasets limits the model's learning capability. ### Solutions 1. **ViViD Framework**: - **Garment Encoder**: Extracts fine-grained garment semantic features, guiding the model to capture garment details and inject them into the target video through an attention feature fusion mechanism. - **Pose Encoder**: A lightweight pose encoder that encodes pose signals, enabling the model to learn the interaction between garments and human poses. - **Temporal Modules**: Inserts hierarchical temporal modules into the text-to-image stable diffusion model to ensure spatial and temporal consistency, generating more coherent and realistic videos. 2. **New Dataset**: - The authors constructed a new dataset, ViViD, containing 9,700 pairs of high-resolution (832×624) garment-video samples, totaling 1,213,694 frames. This is currently the largest, most diverse, and highest-resolution video virtual try-on dataset. ### Experimental Validation - **Qualitative Results**: By comparing with existing methods (such as FateZero, OOTDiffusion, StableVITON, ClothFormer, etc.), ViViD excels in generating high-quality, temporally consistent videos. - **Quantitative Results**: Evaluated using metrics such as Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), and Video Frechet Inception Distance (VFID), ViViD outperforms other methods across multiple metrics. ### Contributions 1. Proposed a new architecture utilizing powerful diffusion models to generate high-quality video virtual try-on results. 2. Introduced pose encoders and temporal modules to improve temporal consistency. 3. Constructed a multi-category, high-quality dataset, ViViD, containing 9,700 pairs of garment-video samples. 4. Validated the effectiveness of the method through quantitative and qualitative experiments. Overall, this paper addresses the temporal consistency and visual quality issues in video virtual try-on by proposing the ViViD framework and constructing a high-quality dataset, providing new directions and tools for research in this field.

ViViD: Video Virtual Try-on using Diffusion Models

Fashion-VDM: Video Diffusion Model for Virtual Try-On

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

GPD-VVTO: Preserving Garment Details in Video Virtual Try-On

ClothFormer:Taming Video Virtual Try-on in All Module

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Improving Virtual Try-On with Garment-focused Diffusion Models

Improving Diffusion Models for Virtual Try-on

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

A Two-stage Personalized Virtual Try-on Framework with Shape Control and Texture Guidance

Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

IMAGDressing-v1: Customizable Virtual Dressing

A 3D Virtual Try-On Method with Global-Local Alignment and Diffusion Model.

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Enhancing consistency in virtual try-on: A novel diffusion-based approach