Abstract:The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at <a class="link-external link-https" href="https://github.com/Hoar012/RAP-MLLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing large - scale language models (LLMs) and their multimodal extensions (MLLMs) in personalized applications. In particular, they lack user - specific knowledge, which restricts their effectiveness as personalized assistants in human daily life. Specifically: 1. **Lack of user - specific knowledge**: Although existing MLLMs have been trained on large - scale datasets and possess strong recognition and classification capabilities, there are still challenges in directly transferring this knowledge to users' personal concepts. For example, the current leading MLLMs cannot remember users' pet names, even if the users have mentioned them before, and also lack awareness of users' identities and preferences. 2. **Insufficient data for personalized generation tasks**: Currently, there is a lack of large - scale datasets to train the personalized generation capabilities of MLLMs, which makes it impractical to collect a large amount of personal data to train a unique assistant for each user. To solve these problems, the author proposes a framework named **Retrieval - Augmented Personalization (RAP)**, aiming to enable MLLMs to update their supported concepts without additional training. The RAP framework achieves this goal through three key steps: - **Remember**: Design a key - value database to store user - related information, such as usernames, avatars, and other attributes. - **Retrieve**: When the user initiates a conversation, RAP will use a multimodal retriever to retrieve relevant information from the database. - **Generate**: Integrate the input query and the retrieved concept information into the input of MLLMs to generate personalized, knowledge - enhanced responses. In addition, to further improve the generation quality and consistency with user - specific information, the author designs a data collection pipeline and creates a dataset specifically for personalized training of MLLMs. Based on this dataset, the author trains a series of MLLMs as personalized multimodal assistants. Experimental results show that the proposed RAP - MLLMs perform well in various personalized generation tasks, including personalized image captioning, question - answering, and visual recognition.

Retrieval-Augmented Personalization for Multimodal Large Language Models

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Integrating Summarization and Retrieval for Enhanced Personalization via Large Language Models

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

Personalized Multimodal Large Language Models: A Survey

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

RRAML: Reinforced Retrieval Augmented Machine Learning

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models

PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant

Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation

LLMGA: Multimodal Large Language Model based Generation Assistant

PMG : Personalized Multimodal Generation with Large Language Models

Yo'LLaVA: Your Personalized Language and Vision Assistant