WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

Zijian He,Peixin Chen,Guangrun Wang,Guanbin Li,Philip H.S. Torr,Liang Lin
2024-07-15
Abstract:Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencoder for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos. The project page website is at <a class="link-external link-http" href="http://wildvidfit-project.github.io" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in Video Virtual Try - On (VVT). These problems mainly focus on how to generate realistic video sequences that can maintain the characteristics of clothing and adapt to the posture and body shape changes of the characters in the source video. Specifically, the paper focuses on the following aspects: 1. **Complex human motion and occlusion handling**: Traditional image - based methods rely on deformation and fusion techniques and perform poorly when dealing with complex limb motions and occlusions, which limits their effectiveness in video try - on applications. The method proposed in this paper aims to overcome this challenge by using an image - based controlled diffusion model to handle complex limb motion and occlusion problems. 2. **Data and computing resource requirements**: Video - based models require a large amount of high - quality data and computing resources. The paper proposes a framework that does not require video training and is trained only with static images, thereby reducing the requirements for data and computing resources. 3. **Temporal coherence**: In order to generate smooth and coherent videos, the paper introduces a diffusion - guided module, which uses a pre - trained video mask auto - encoder and a self - supervised model to improve the smoothness of video segments and the feature alignment between adjacent frames, thereby enhancing temporal consistency. 4. **Generalization ability**: The method proposed in this paper performs well in videos in the "wild" environment and can handle complex dance movements and dynamic postures, thanks to its robustness and generalization ability in complex scenes. In summary, the main objective of this paper is to develop an efficient, robust method that can generate high - quality video virtual try - on results in complex environments, and through technological innovation, it has solved the deficiencies of existing methods in handling complex motions, data requirements, and temporal coherence.