FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Suvir Mirchandani,Licheng Yu,Mengjiao Wang,Animesh Sinha,Wenwen Jiang,Tao Xiang,Ning Zhang
DOI: https://doi.org/10.48550/arXiv.2210.15028
2022-10-27
Abstract:Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems - e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design and pre-training approach are competitive on a diverse set of fashion tasks, including cross-modal retrieval, image retrieval with text feedback, image captioning, relative image captioning, and multimodal categorization.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on multi - modal tasks in the fashion field, especially visual and language processing problems in e - commerce applications. Specifically, the paper aims to solve the following key issues: 1. **Data Limitations**: Existing research on multi - modal fashion tasks is either limited by the data of individual benchmark datasets, or although it utilizes general visual and language pre - training models, it fails to fully utilize the characteristics of fashion data. This results in poor performance on specific fashion tasks. 2. **Task Limitations**: Most of the existing work mainly focuses on multi - modal understanding tasks and ignores the importance of generation tasks (such as image captioning). These tasks are crucial for developing interactive multi - modal shopping assistants, for example, retrieving the required clothing items through language queries or describing products in detail. 3. **Challenges of Cross - Modal Tasks**: Especially in the tasks of image retrieval and text feedback, it is required to retrieve the target image according to the reference image and the user's language feedback. This task requires the model to be able to understand and process complex visual and language information and perform fine - grained matching. To address these problems, the paper makes two main contributions: 1. **Novel Pre - training Framework**: Based on weakly - supervised triplets constructed from fashion image - text pairs, a new fashion - specific pre - training framework is proposed. The paper shows that triplet - based tasks can effectively supplement standard multi - modal pre - training tasks. 2. **Flexible Decoder Architecture**: A decoder - based model architecture is proposed, which can handle both fashion retrieval tasks and image captioning tasks. This design enables the model to perform well on a variety of fashion tasks, including cross - modal retrieval, image retrieval with text feedback, image captioning, relative image captioning and multi - modal classification. Through these contributions, the paper aims to improve the performance of multi - modal models in the fashion field, thereby enhancing the user's shopping experience.