Abstract:Image-based virtual try-on aims to fit an in-shop garment onto a clothed person image. Garment warping, which aligns the target garment with the corresponding body parts in the person image, is a crucial step in achieving this goal. Existing methods often use multi-stage frameworks to handle clothes warping, person body synthesis and tryon generation separately or rely on noisy intermediate parser-based labels. We propose a novel single-stage framework that implicitly learns the same without explicit multi-stage learning. Our approach utilizes a novel semantic-contextual fusion attention module for garment-person feature fusion, enabling efficient and realistic cloth warping and body synthesis from target pose keypoints. By introducing a lightweight linear attention framework that attends to garment regions and fuses multiple sampled flow fields, we also address misalignment and artifacts present in previous methods. To achieve simultaneous learning of warped garment and try-on results, we introduce a Warped Cloth Learning Module. Our proposed approach significantly improves the quality and efficiency of virtual try-on methods, providing users with a more reliable and realistic virtual try-on experience.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the multi - stage processing in virtual try - on technology and the problems of inaccurate alignment, texture distortion and artifacts in existing methods. Specifically, existing virtual try - on methods usually adopt a multi - stage framework to handle clothing warping, human body synthesis and the generation of final try - on results separately, which leads to low efficiency and poor results. In addition, many methods rely on intermediate parsing labels (such as human body parsing or dense pose), and these labels may introduce noise and affect the quality of the final result. To solve these problems, this paper proposes a novel single - stage framework, which can implicitly learn clothing warping and human body synthesis without explicit multi - stage learning. Its main contributions include: 1. **Warped Cloth Learning Module (WCLM)**: This module can jointly learn the warped clothing, human body synthesis and the final try - on result as a single learning process. 2. **Lightweight Linear Attention Framework (LAF)**: This framework focuses on the clothing area and fuses multiple sampling flow fields to learn the optimal implicit clothing flow, thereby reducing alignment errors and artifacts. 3. **Semantic - Contextual Fusion Attention (SCFA)**: This module improves the effect of feature fusion by capturing the semantic information and contextual relationship between clothing and the human body, making clothing warping more natural and realistic. Through these improvements, this method has achieved state - of - the - art results on the VITON dataset, significantly improving the quality and efficiency of virtual try - on and providing users with a more reliable and realistic virtual try - on experience. ### Formula Summary 1. **Linear Self - Attention Mechanism**: \[ \text{head}_i=\text{Attention}(QWQ_i, E_iKWK_i, FiV WV_i)=\text{softmax}\left(\frac{QWQ_i(E_iKWK_i)^T}{\sqrt{d_k}}\right)\cdot(FiV WV_i) \] where \(E_i\) and \(F_i\) are linear projection matrices, and \(KWK_i\) and \(V WV_i\) are the embedding layers of keys and values respectively. 2. **Attention Weight Calculation**: \[ \text{Attention Weight}_{\text{person}}=\text{softmax}\left(\frac{E_{\text{garment}}\cdot(E_{\text{person}})^T}{\sqrt{C}}\right) \] \[ \text{Attention Weight}_{\text{garment}}=\text{softmax}\left(\frac{E_{\text{person}}\cdot(E_{\text{garment}})^T}{\sqrt{C}}\right) \] 3. **Loss Function**: - **L1 Loss**: \[ L_{\text{L1}} = ||I_p^{\text{output}}-I_p||_1+L_{\text{warp}} \] - **Perceptual Loss**: \[ L_{\text{perc}}=\sum_{i = 1}^{5}||\phi_i(I_p^{\text{output}})-\phi_i(I_p)||_1 \] - **Style Loss**: \[ L_{\text{style}}=\sum_{i}||G_{\phi_i}(I_p^{

Single Stage Warped Cloth Learning and Semantic-Contextual Attention Feature Fusion for Virtual TryOn

GraVITON: Graph based garment warping with attention guided inversion for Virtual-tryon

Toward Realistic Virtual Try-on Through Landmark Guided Shape Matching

SieveNet: A Unified Framework for Robust Image-Based Virtual Try-On

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Significance of Skeleton-based Features in Virtual Try-On

ClothFormer:Taming Video Virtual Try-on in All Module

Cloth Interactive Transformer for Virtual Try-On

Limb-Aware Virtual Try-On Network with Progressive Clothing Warping

Toward Accurate and Realistic Virtual Try-on Through Shape Matching and Multiple Warps

MT-VTON: Multilevel Transformation-Based Virtual Try-On for Enhancing Realism of Clothing

VTON-SCFA: A Virtual Try-On Network Based on the Semantic Constraints and Flow Alignment

VTNCT: an Image-Based Virtual Try-on Network by Combining Feature with Pixel Transformation

Enhancing consistency in virtual try-on: A novel diffusion-based approach

PG-VTON: A Novel Image-Based Virtual Try-On Method Via Progressive Inference Paradigm

LGVTON: A Landmark Guided Approach to Virtual Try-On

ClothFit: Cloth-Human-Attribute Guided Virtual Try-On Network Using 3D Simulated Dataset

High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions

GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning

Improving Diffusion Models for Virtual Try-on