Abstract:The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at <a class="link-external link-https" href="https://github.com/Hoar012/RAP-MLLM" rel="external noopener nofollow">this https URL</a>.

Recurrent Neural Network Based Language Model Personalization by Social Network Crowdsourcing

Personalizing universal recurrent neural network language model with user characteristic features by social network crowdsourcing

Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications

A Persona-Based Neural Conversation Model

Personalized Speech Recognizer With Keyword-Based Personalized Lexicon And Language Model Using Word Vector Representations

Efficient Transfer Learning Schemes for Personalized Language Modeling using Recurrent Neural Network

Deep Shallow Fusion for RNN-T Personalization

Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning Using Acoustic Tokens Discovered from Unlabeled Data

Recurrent Neural Network Language Model with Part-of-speech for Mandarin Speech Recognition.

Recurrent Neural Network Based Language Model Adaptation for Accent Mandarin Speech.

Learning from My Friends: Few-Shot Personalized Conversation Systems via Social Networks

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Neural personalized response generation as domain adaptation

Federated Natural Language Generation for Personalized Dialogue System

Adaptive User Modeling with Long and Short-Term Preferences for Personalized Recommendation.

Personalized word representations Carrying Personalized Semantics Learned from Social Network Posts

Retrieval-Augmented Personalization for Multimodal Large Language Models

Personalized Language Modeling from Personalized Human Feedback

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Personalized Multimodal Large Language Models: A Survey