Abstract:Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencoder for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos. The project page website is at <a class="link-external link-http" href="http://wildvidfit-project.github.io" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in Video Virtual Try - On (VVT). These problems mainly focus on how to generate realistic video sequences that can maintain the characteristics of clothing and adapt to the posture and body shape changes of the characters in the source video. Specifically, the paper focuses on the following aspects: 1. **Complex human motion and occlusion handling**: Traditional image - based methods rely on deformation and fusion techniques and perform poorly when dealing with complex limb motions and occlusions, which limits their effectiveness in video try - on applications. The method proposed in this paper aims to overcome this challenge by using an image - based controlled diffusion model to handle complex limb motion and occlusion problems. 2. **Data and computing resource requirements**: Video - based models require a large amount of high - quality data and computing resources. The paper proposes a framework that does not require video training and is trained only with static images, thereby reducing the requirements for data and computing resources. 3. **Temporal coherence**: In order to generate smooth and coherent videos, the paper introduces a diffusion - guided module, which uses a pre - trained video mask auto - encoder and a self - supervised model to improve the smoothness of video segments and the feature alignment between adjacent frames, thereby enhancing temporal consistency. 4. **Generalization ability**: The method proposed in this paper performs well in videos in the "wild" environment and can handle complex dance movements and dynamic postures, thanks to its robustness and generalization ability in complex scenes. In summary, the main objective of this paper is to develop an efficient, robust method that can generate high - quality video virtual try - on results in complex environments, and through technological innovation, it has solved the deficiencies of existing methods in handling complex motions, data requirements, and temporal coherence.

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

ViViD: Video Virtual Try-on using Diffusion Models

Fashion-VDM: Video Diffusion Model for Virtual Try-On

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

ClothFormer:Taming Video Virtual Try-on in All Module

Toward Realistic Virtual Try-on Through Landmark Guided Shape Matching

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

A Two-stage Personalized Virtual Try-on Framework with Shape Control and Texture Guidance

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

VITON: An Image-based Virtual Try-on Network

Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

Improving Diffusion Models for Virtual Try-on

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

PG-VTON: A Novel Image-Based Virtual Try-On Method Via Progressive Inference Paradigm

Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization

Improving Virtual Try-On with Garment-focused Diffusion Models