Abstract:Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems - e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design and pre-training approach are competitive on a diverse set of fashion tasks, including cross-modal retrieval, image retrieval with text feedback, image captioning, relative image captioning, and multimodal categorization.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on multi - modal tasks in the fashion field, especially visual and language processing problems in e - commerce applications. Specifically, the paper aims to solve the following key issues: 1. **Data Limitations**: Existing research on multi - modal fashion tasks is either limited by the data of individual benchmark datasets, or although it utilizes general visual and language pre - training models, it fails to fully utilize the characteristics of fashion data. This results in poor performance on specific fashion tasks. 2. **Task Limitations**: Most of the existing work mainly focuses on multi - modal understanding tasks and ignores the importance of generation tasks (such as image captioning). These tasks are crucial for developing interactive multi - modal shopping assistants, for example, retrieving the required clothing items through language queries or describing products in detail. 3. **Challenges of Cross - Modal Tasks**: Especially in the tasks of image retrieval and text feedback, it is required to retrieve the target image according to the reference image and the user's language feedback. This task requires the model to be able to understand and process complex visual and language information and perform fine - grained matching. To address these problems, the paper makes two main contributions: 1. **Novel Pre - training Framework**: Based on weakly - supervised triplets constructed from fashion image - text pairs, a new fashion - specific pre - training framework is proposed. The paper shows that triplet - based tasks can effectively supplement standard multi - modal pre - training tasks. 2. **Flexible Decoder Architecture**: A decoder - based model architecture is proposed, which can handle both fashion retrieval tasks and image captioning tasks. This design enables the model to perform well on a variety of fashion tasks, including cross - modal retrieval, image retrieval with text feedback, image captioning, relative image captioning and multi - modal classification. Through these contributions, the paper aims to improve the performance of multi - modal models in the fashion field, thereby enhancing the user's shopping experience.

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Image Reference-guided Fashion Design with Structure-aware Transfer by Diffusion Models.

UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

Training and challenging models for text-guided fashion image retrieval

FashionKLIP: Enhancing E-Commerce Image-Text Retrieval with Fashion Multi-Modal Conceptual Knowledge Graph

FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training

Towards Better Understanding the Clothing Fashion Styles: A Multimodal Deep Learning Approach

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Masked Vision-Language Transformer in Fashion

Efficient Text-Image Semantic Search: a Multi-modal Vision-Language Approach for Fashion Retrieval

Vision-language pre-training via modal interaction

Describe Fashion Products via Local Sparse Self-Attention Mechanism and Attribute-based Re-sampling Strategy

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Fashion Meets Computer Vision

Interpretable Multimodal Retrieval for Fashion Products.

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Unified Vision-Language Pre-Training for Image Captioning and VQA