Single Stage Warped Cloth Learning and Semantic-Contextual Attention Feature Fusion for Virtual TryOn

Sanhita Pathak,Vinay Kaushik,Brejesh Lall
DOI: https://doi.org/10.48550/arXiv.2310.05024
2024-05-25
Abstract:Image-based virtual try-on aims to fit an in-shop garment onto a clothed person image. Garment warping, which aligns the target garment with the corresponding body parts in the person image, is a crucial step in achieving this goal. Existing methods often use multi-stage frameworks to handle clothes warping, person body synthesis and tryon generation separately or rely on noisy intermediate parser-based labels. We propose a novel single-stage framework that implicitly learns the same without explicit multi-stage learning. Our approach utilizes a novel semantic-contextual fusion attention module for garment-person feature fusion, enabling efficient and realistic cloth warping and body synthesis from target pose keypoints. By introducing a lightweight linear attention framework that attends to garment regions and fuses multiple sampled flow fields, we also address misalignment and artifacts present in previous methods. To achieve simultaneous learning of warped garment and try-on results, we introduce a Warped Cloth Learning Module. Our proposed approach significantly improves the quality and efficiency of virtual try-on methods, providing users with a more reliable and realistic virtual try-on experience.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the multi - stage processing in virtual try - on technology and the problems of inaccurate alignment, texture distortion and artifacts in existing methods. Specifically, existing virtual try - on methods usually adopt a multi - stage framework to handle clothing warping, human body synthesis and the generation of final try - on results separately, which leads to low efficiency and poor results. In addition, many methods rely on intermediate parsing labels (such as human body parsing or dense pose), and these labels may introduce noise and affect the quality of the final result. To solve these problems, this paper proposes a novel single - stage framework, which can implicitly learn clothing warping and human body synthesis without explicit multi - stage learning. Its main contributions include: 1. **Warped Cloth Learning Module (WCLM)**: This module can jointly learn the warped clothing, human body synthesis and the final try - on result as a single learning process. 2. **Lightweight Linear Attention Framework (LAF)**: This framework focuses on the clothing area and fuses multiple sampling flow fields to learn the optimal implicit clothing flow, thereby reducing alignment errors and artifacts. 3. **Semantic - Contextual Fusion Attention (SCFA)**: This module improves the effect of feature fusion by capturing the semantic information and contextual relationship between clothing and the human body, making clothing warping more natural and realistic. Through these improvements, this method has achieved state - of - the - art results on the VITON dataset, significantly improving the quality and efficiency of virtual try - on and providing users with a more reliable and realistic virtual try - on experience. ### Formula Summary 1. **Linear Self - Attention Mechanism**: \[ \text{head}_i=\text{Attention}(QWQ_i, E_iKWK_i, FiV WV_i)=\text{softmax}\left(\frac{QWQ_i(E_iKWK_i)^T}{\sqrt{d_k}}\right)\cdot(FiV WV_i) \] where \(E_i\) and \(F_i\) are linear projection matrices, and \(KWK_i\) and \(V WV_i\) are the embedding layers of keys and values respectively. 2. **Attention Weight Calculation**: \[ \text{Attention Weight}_{\text{person}}=\text{softmax}\left(\frac{E_{\text{garment}}\cdot(E_{\text{person}})^T}{\sqrt{C}}\right) \] \[ \text{Attention Weight}_{\text{garment}}=\text{softmax}\left(\frac{E_{\text{person}}\cdot(E_{\text{garment}})^T}{\sqrt{C}}\right) \] 3. **Loss Function**: - **L1 Loss**: \[ L_{\text{L1}} = ||I_p^{\text{output}}-I_p||_1+L_{\text{warp}} \] - **Perceptual Loss**: \[ L_{\text{perc}}=\sum_{i = 1}^{5}||\phi_i(I_p^{\text{output}})-\phi_i(I_p)||_1 \] - **Style Loss**: \[ L_{\text{style}}=\sum_{i}||G_{\phi_i}(I_p^{