UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

Xiangyu Zhao,Yuehan Zhang,Wenlong Zhang,Xiao-Ming Wu
2024-10-12
Abstract:The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at <a class="link-external link-https" href="https://github.com/xiangyu-mm/UniFashion" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of current multi - modal fashion models when handling generation and retrieval tasks. Specifically: 1. **Lack of a unified framework**: Existing fashion models usually focus on a single task (such as image generation or text retrieval), and there is no unified framework that can handle multiple tasks simultaneously, such as cross - modal retrieval, component - based image retrieval, fashion image caption generation, and fashion image generation. 2. **Neglect of embedding tasks**: Most existing models fail to effectively handle embedding tasks, such as image - to - text and text - to - image retrieval, which are very important in the fashion field. 3. **Insufficient generation ability**: Existing unified fashion models often lack the ability to generate images, especially performing poorly in tasks that require generating target images based on multi - modal inputs. To solve these problems, the authors propose **UniFashion**, which is a unified vision - language model framework aiming to solve multi - modal generation and retrieval tasks in the fashion field by integrating large language models (LLM) and diffusion models. The main contributions of UniFashion include: - **For the first time, in - depth research on the collaborative modeling of multi - modal retrieval and generation tasks in the fashion field** has been carried out, fully utilizing the correlations between tasks and introducing a general and unified model to handle all fashion tasks. - **Performance has been improved through mutual reinforcement between tasks**. Specifically, the caption generation module assists the component - based image retrieval task, and jointly training the generation and retrieval tasks improves the multi - modal encoder of the diffusion module. - **Extensive experiments show that** on multiple fashion tasks (such as cross - modal retrieval, component - based image retrieval, and multi - modal generation), this unified model significantly outperforms the previous state - of - the - art methods. Through these improvements, UniFashion can not only perform well in multiple fashion tasks, but also provides a promising direction for future research.