Abstract:The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at <a class="link-external link-https" href="https://github.com/xiangyu-mm/UniFashion" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of current multi - modal fashion models when handling generation and retrieval tasks. Specifically: 1. **Lack of a unified framework**: Existing fashion models usually focus on a single task (such as image generation or text retrieval), and there is no unified framework that can handle multiple tasks simultaneously, such as cross - modal retrieval, component - based image retrieval, fashion image caption generation, and fashion image generation. 2. **Neglect of embedding tasks**: Most existing models fail to effectively handle embedding tasks, such as image - to - text and text - to - image retrieval, which are very important in the fashion field. 3. **Insufficient generation ability**: Existing unified fashion models often lack the ability to generate images, especially performing poorly in tasks that require generating target images based on multi - modal inputs. To solve these problems, the authors propose **UniFashion**, which is a unified vision - language model framework aiming to solve multi - modal generation and retrieval tasks in the fashion field by integrating large language models (LLM) and diffusion models. The main contributions of UniFashion include: - **For the first time, in - depth research on the collaborative modeling of multi - modal retrieval and generation tasks in the fashion field** has been carried out, fully utilizing the correlations between tasks and introducing a general and unified model to handle all fashion tasks. - **Performance has been improved through mutual reinforcement between tasks**. Specifically, the caption generation module assists the component - based image retrieval task, and jointly training the generation and retrieval tasks improves the multi - modal encoder of the diffusion module. - **Extensive experiments show that** on multiple fashion tasks (such as cross - modal retrieval, component - based image retrieval, and multi - modal generation), this unified model significantly outperforms the previous state - of - the - art methods. Through these improvements, UniFashion can not only perform well in multiple fashion tasks, but also provides a promising direction for future research.

UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

Image Reference-guided Fashion Design with Structure-aware Transfer by Diffusion Models.

Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

Efficient Text-Image Semantic Search: a Multi-modal Vision-Language Approach for Fashion Retrieval

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Multi-Garment Customized Model Generation

Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval and Analysis

MMFL-Net: Multi-scale and Multi-granularity Feature Learning for Cross-domain Fashion Retrieval

FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion

Interpretable Multimodal Retrieval for Fashion Products.

Towards Better Understanding the Clothing Fashion Styles: A Multimodal Deep Learning Approach

Fashion Meets Computer Vision

MMFashion: An Open-Source Toolbox for Visual Fashion Analysis

Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition.

A Unified Model with Structured Output for Fashion Images Classification

Visually-Aware Fashion Recommendation and Design with Generative Image Models