Abstract:Video Virtual Try-On aims to transfer a garment onto a person in the video. Previous methods typically focus on image-based virtual try-on, but directly applying these methods to videos often leads to temporal discontinuity due to inconsistencies between frames. Limited attempts in video virtual try-on also suffer from unrealistic results and poor generalization ability. In light of previous research, we posit that the task of video virtual try-on can be decomposed into two key aspects: (1) single-frame results are realistic and natural, while retaining consistency with the garment; (2) the person's actions and the garment are coherent throughout the entire video. To address these two aspects, we propose a novel two-stage framework based on Latent Diffusion Model, namely Garment-Preserving Diffusion for Video Virtual Try-On (GPD-VVTO). In the first stage, the model is trained on single-frame data to improve the ability of generating high-quality try-on images. We integrate both low-level texture features and high-level semantic features of the garment into the denoising network to preserve garment details while ensuring a natural fit between the garment and the person. In the second stage, the model is trained on video data to enhance temporal consistency. We devise a novel Garment-aware Temporal Attention (GTA) module that incorporates garment features into temporal attention, enabling the model to maintain the fidelity to the garment during temporal modeling. Furthermore, we collect a video virtual try-on dataset containing high-resolution videos from diverse scenes, addressing the limited variety of current datasets in terms of video background and human actions. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods in both image-based and video-based virtual try-on tasks, indicating the effectiveness of our proposed framework.

GPD-VVTO: Preserving Garment Details in Video Virtual Try-On

Fashion-VDM: Video Diffusion Model for Virtual Try-On

ViViD: Video Virtual Try-on using Diffusion Models

GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning

Improving Virtual Try-On with Garment-focused Diffusion Models

Toward Realistic Virtual Try-on Through Landmark Guided Shape Matching

PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

A Two-stage Personalized Virtual Try-on Framework with Shape Control and Texture Guidance

ClothFormer:Taming Video Virtual Try-on in All Module

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Improving Diffusion Models for Virtual Try-on

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Enhancing consistency in virtual try-on: A novel diffusion-based approach

PG-VTON: A Novel Image-Based Virtual Try-On Method Via Progressive Inference Paradigm

DP-VTON: Toward Detail-Preserving Image-Based Virtual Try-on Network

SPG-VTON: Semantic Prediction Guidance for Multi-pose Virtual Try-on