DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning

Jiabao Wei,Zhiyuan Ma
2024-10-16
Abstract:Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which can be trained to align with the large-scale intrinsic knowledge with deep text semantics (e.g.,"neckline" or "girdle") to make up for the deficiency of the commonly adopted CLIP encoder. Based on this, to enhance the customized dressing abilities, we further introduce Garment-Feature ControlNet Plus (abbr. GFC+) module and propose to leverage a fresh hybrid attention strategy for training, which can adaptively integrate fine-grained characteristics of the garments into the different layers of the VTON model, so as to achieve multi-scale features preservation effects. Extensive experiments on several representative datasets demonstrate that our method outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to address two core challenges in virtual try - on (VTON): 1. **Fine - grained Semantic Extraction**: During the depth estimation process, how to effectively extract fine - grained semantic information (such as "collar" or "belt") from the given reference clothing. Existing methods usually use CLIP encoders, but the features they extract are relatively coarse and difficult to meet the requirements of fine - grained semantic extraction. 2. **Texture Detail Preservation**: When synthesizing and warping clothing onto the human body, how to effectively preserve the texture and pattern details of the clothing. Existing methods also have deficiencies in this regard, especially when dealing with complex and detailed clothing features. To solve these problems, the authors propose DH - VTON, a depth - text - driven virtual try - on model. This model has the following features: - **Hybrid Attention Learning Strategy**: By introducing a new hybrid attention mechanism, the fine - grained features of clothing are adaptively integrated into different layers of the VTON model, thereby achieving multi - scale feature preservation. - **Deep Clothing Semantic Preservation Module**: For the first time, InternViT - 6B is introduced into the VTON task as a fine - grained feature learner to extract the deep - layer semantic information of clothing and make up for the deficiencies of the CLIP encoder. - **Enhanced Customized Try - on Ability**: The Garment - Feature ControlNet Plus (GFC +) module is designed to further enhance the model's customized try - on ability, especially in terms of preserving the texture and pattern of the given clothing. Through extensive experiments on multiple representative datasets, the results show that DH - VTON outperforms previous diffusion models and GAN models in generating realistic images and preserving clothing details.