VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Jun Zheng,Fuwei Zhao,Youjiang Xu,Xin Dong,Xiaodan Liang

2024-06-08

Abstract:Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The main goal of this paper is to achieve video virtual try-on in real complex scenarios. Specifically, existing video try-on methods are mainly limited to transferring product clothing images onto videos of individuals with simple poses and perform poorly when dealing with casually shot videos. To address this issue, the authors propose a video try-on framework based on Diffusion Transformer (DiT) — VITON-DiT. The main contributions of VITON-DiT include: 1. **Proposing the first video try-on network based on DiT**: VITON-DiT is capable of generating try-on results with spatiotemporal consistency, suitable for complex real-world videos. 2. **Designing an attention fusion algorithm**: This algorithm connects the clothing extractor with the Denoising DiT and ID ControlNet modules, thereby accurately recovering clothing details in the video. 3. **Developing new random selection strategies and interpolation autoregressive techniques**: These techniques help generate high-quality videos lasting up to several tens of seconds. 4. **Scalability to unpaired human video data**: VITON-DiT only requires unpaired human dance video data and improves data utilization efficiency through a multi-stage training strategy. 5. **Creating a new video try-on benchmark dataset**: This dataset is used to evaluate video try-on performance in complex backgrounds. In summary, VITON-DiT aims to improve the quality and practicality of video virtual try-on through novel methods and techniques, particularly excelling in complex scenarios and diverse clothing types.

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

ViViD: Video Virtual Try-on using Diffusion Models

Fashion-VDM: Video Diffusion Model for Virtual Try-On

FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

GPD-VVTO: Preserving Garment Details in Video Virtual Try-On

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Dressing in the Wild by Watching Dance Videos

OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

ClothFormer:Taming Video Virtual Try-on in All Module

VITON: An Image-based Virtual Try-on Network

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Toward Realistic Virtual Try-on Through Landmark Guided Shape Matching

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Improving Diffusion Models for Virtual Try-on

ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On