Retrieval-Augmented Personalization for Multimodal Large Language Models

Haoran Hao,Jiaming Han,Changsheng Li,Yu-Feng Li,Xiangyu Yue
2024-11-18
Abstract:The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at <a class="link-external link-https" href="https://github.com/Hoar012/RAP-MLLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing large - scale language models (LLMs) and their multimodal extensions (MLLMs) in personalized applications. In particular, they lack user - specific knowledge, which restricts their effectiveness as personalized assistants in human daily life. Specifically: 1. **Lack of user - specific knowledge**: Although existing MLLMs have been trained on large - scale datasets and possess strong recognition and classification capabilities, there are still challenges in directly transferring this knowledge to users' personal concepts. For example, the current leading MLLMs cannot remember users' pet names, even if the users have mentioned them before, and also lack awareness of users' identities and preferences. 2. **Insufficient data for personalized generation tasks**: Currently, there is a lack of large - scale datasets to train the personalized generation capabilities of MLLMs, which makes it impractical to collect a large amount of personal data to train a unique assistant for each user. To solve these problems, the author proposes a framework named **Retrieval - Augmented Personalization (RAP)**, aiming to enable MLLMs to update their supported concepts without additional training. The RAP framework achieves this goal through three key steps: - **Remember**: Design a key - value database to store user - related information, such as usernames, avatars, and other attributes. - **Retrieve**: When the user initiates a conversation, RAP will use a multimodal retriever to retrieve relevant information from the database. - **Generate**: Integrate the input query and the retrieved concept information into the input of MLLMs to generate personalized, knowledge - enhanced responses. In addition, to further improve the generation quality and consistency with user - specific information, the author designs a data collection pipeline and creates a dataset specifically for personalized training of MLLMs. Based on this dataset, the author trains a series of MLLMs as personalized multimodal assistants. Experimental results show that the proposed RAP - MLLMs perform well in various personalized generation tasks, including personalized image captioning, question - answering, and visual recognition.