Abstract:Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which can be trained to align with the large-scale intrinsic knowledge with deep text semantics (e.g.,"neckline" or "girdle") to make up for the deficiency of the commonly adopted CLIP encoder. Based on this, to enhance the customized dressing abilities, we further introduce Garment-Feature ControlNet Plus (abbr. GFC+) module and propose to leverage a fresh hybrid attention strategy for training, which can adaptively integrate fine-grained characteristics of the garments into the different layers of the VTON model, so as to achieve multi-scale features preservation effects. Extensive experiments on several representative datasets demonstrate that our method outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.

What problem does this paper attempt to address?

This paper aims to address two core challenges in virtual try - on (VTON): 1. **Fine - grained Semantic Extraction**: During the depth estimation process, how to effectively extract fine - grained semantic information (such as "collar" or "belt") from the given reference clothing. Existing methods usually use CLIP encoders, but the features they extract are relatively coarse and difficult to meet the requirements of fine - grained semantic extraction. 2. **Texture Detail Preservation**: When synthesizing and warping clothing onto the human body, how to effectively preserve the texture and pattern details of the clothing. Existing methods also have deficiencies in this regard, especially when dealing with complex and detailed clothing features. To solve these problems, the authors propose DH - VTON, a depth - text - driven virtual try - on model. This model has the following features: - **Hybrid Attention Learning Strategy**: By introducing a new hybrid attention mechanism, the fine - grained features of clothing are adaptively integrated into different layers of the VTON model, thereby achieving multi - scale feature preservation. - **Deep Clothing Semantic Preservation Module**: For the first time, InternViT - 6B is introduced into the VTON task as a fine - grained feature learner to extract the deep - layer semantic information of clothing and make up for the deficiencies of the CLIP encoder. - **Enhanced Customized Try - on Ability**: The Garment - Feature ControlNet Plus (GFC +) module is designed to further enhance the model's customized try - on ability, especially in terms of preserving the texture and pattern of the given clothing. Through extensive experiments on multiple representative datasets, the results show that DH - VTON outperforms previous diffusion models and GAN models in generating realistic images and preserving clothing details.

DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning

Toward Detail-Oriented Image-Based Virtual Try-On with Arbitrary Poses

DP-VTON: Toward Detail-Preserving Image-Based Virtual Try-on Network

VTON-HF: High Fidelity Virtual Try-on Network Via Semantic Adaptation

Toward Realistic Virtual Try-on Through Landmark Guided Shape Matching

StyleVTON: A multi-pose virtual try-on with identity and clothing detail preservation

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Context-Aware Enhanced Virtual Try-On Network with Fabric Adaptive Registration

D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning

SPG-VTON: Semantic Prediction Guidance for Multi-pose Virtual Try-on

Slot-VTON: Subject-Driven Diffusion-Based Virtual Try-on with Slot Attention

Towards Multi-pose Guided Virtual Try-on Network

IMAGDressing-v1: Customizable Virtual Dressing

ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On

C-VTON: Context-Driven Image-Based Virtual Try-On Network

Enhancing consistency in virtual try-on: A novel diffusion-based approach

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

PG-VTON: A Novel Image-Based Virtual Try-On Method Via Progressive Inference Paradigm

PF-VTON: Toward High-Quality Parser-Free Virtual Try-On Network