Abstract:Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both the fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L data point contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To exploit this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves a new state of the art across five downstream tasks. Code is available at <a class="link-external link-https" href="https://github.com/BrandonHanx/mmf" rel="external noopener nofollow">this https URL</a>.

FashionKLIP: Enhancing E-Commerce Image-Text Retrieval with Fashion Multi-Modal Conceptual Knowledge Graph

Fashionsketch: An Interactive Sketch-Based Massive Image Retrieval For Fashion E-Business

Extending CLIP for Category-to-image Retrieval in E-commerce

Knowledge Perceived Multi-modal Pretraining in E-commerce

Interpretable Multimodal Retrieval for Fashion Products.

Large Scale Pre-Trained Knowledge Graph Model and E-Commerce Application

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

Fashion Image Retrieval with Text Feedback by Additive Attention Compositional Learning

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

PKGM: A Pre-trained Knowledge Graph Model for E-commerce Application

Efficient Text-Image Semantic Search: a Multi-modal Vision-Language Approach for Fashion Retrieval

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback

Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce

FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval

End-to-end multi-modal product matching in fashion e-commerce