Abstract:Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, "What should I buy for my dog's birthday?"; as opposed to a generic inquiry about "What should I buy for a dog's birthday?". Similarly, when looking at a friend's image, the interest lies in seeing their activities (e.g., "my friend is holding a cat"), rather than merely observing generic human actions (e.g., "a man is holding a cat"). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo'LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo'LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although existing large - scale multimodal models (LMMs) perform well on a variety of tasks (such as image caption generation, visual question answering, etc.), their knowledge still remains at a general level and they are unable to handle personalized topics (such as recognizing a user's pet dog). This is because the training data of these models mainly consists of common and general concepts and lacks personalized concepts. This causes them to be unable to recognize specific objects or provide personalized details without additional context. Specifically, the paper aims to enable LMMs to adapt and answer questions related to users' specific concepts by introducing a new personalized multimodal model named Yo’LLaV A. For example, when a user asks "What is <bo> in the photo doing?" or "What birthday present should I buy for <bo>?", existing LMMs cannot provide personalized answers. However, Yo’LLaV A can embed personalized concepts by learning a small number of images of the target object and conduct conversations and question - answering accordingly. ### Main problem summary: 1. **Limitations of existing LMMs**: Unable to handle personalized queries, such as recognizing specific objects or providing personalized suggestions. 2. **Personalized needs**: Users hope that AI assistants can understand and respond to personalized questions related to specific objects (such as pets, friends, etc.). 3. **Technical challenges**: How to make LMMs learn new personalized concepts without affecting pre - trained knowledge and be able to capture fine - grained visual features. ### Solutions: - **Personalized multimodal model (Yo’LLaV A)**: Embed personalized concepts into the model by learning a small number of images of the target object. - **Hard negative example mining**: By introducing negative examples that are visually similar but not identical, help the model better learn to distinguish the subtle features of the target object. - **Efficient framework**: Update only a small number of parameters and retain the core weights of the pre - trained model to ensure that the model does not forget existing knowledge. Through these methods, Yo’LLaV A can effectively learn and understand personalized concepts while maintaining extensive pre - trained knowledge, thus achieving more natural and personalized interactions.

Yo'LLaVA: Your Personalized Language and Vision Assistant

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

MyVLM: Personalizing VLMs for User-Specific Queries

Retrieval-Augmented Personalization for Multimodal Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Visually-Augmented Language Modeling

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

Training a Vision Language Model as Smartphone Assistant

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Meta-Personalizing Vision-Language Models to Find Named Instances in Video

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Towards Interpreting Visual Information Processing in Vision-Language Models

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant