Yo'LLaVA: Your Personalized Language and Vision Assistant

Thao Nguyen,Haotian Liu,Yuheng Li,Mu Cai,Utkarsh Ojha,Yong Jae Lee
2024-06-14
Abstract:Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, "What should I buy for my dog's birthday?"; as opposed to a generic inquiry about "What should I buy for a dog's birthday?". Similarly, when looking at a friend's image, the interest lies in seeing their activities (e.g., "my friend is holding a cat"), rather than merely observing generic human actions (e.g., "a man is holding a cat"). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo'LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo'LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA).
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Although existing large - scale multimodal models (LMMs) perform well on a variety of tasks (such as image caption generation, visual question answering, etc.), their knowledge still remains at a general level and they are unable to handle personalized topics (such as recognizing a user's pet dog). This is because the training data of these models mainly consists of common and general concepts and lacks personalized concepts. This causes them to be unable to recognize specific objects or provide personalized details without additional context. Specifically, the paper aims to enable LMMs to adapt and answer questions related to users' specific concepts by introducing a new personalized multimodal model named Yo’LLaV A. For example, when a user asks "What is <bo> in the photo doing?" or "What birthday present should I buy for <bo>?", existing LMMs cannot provide personalized answers. However, Yo’LLaV A can embed personalized concepts by learning a small number of images of the target object and conduct conversations and question - answering accordingly. ### Main problem summary: 1. **Limitations of existing LMMs**: Unable to handle personalized queries, such as recognizing specific objects or providing personalized suggestions. 2. **Personalized needs**: Users hope that AI assistants can understand and respond to personalized questions related to specific objects (such as pets, friends, etc.). 3. **Technical challenges**: How to make LMMs learn new personalized concepts without affecting pre - trained knowledge and be able to capture fine - grained visual features. ### Solutions: - **Personalized multimodal model (Yo’LLaV A)**: Embed personalized concepts into the model by learning a small number of images of the target object. - **Hard negative example mining**: By introducing negative examples that are visually similar but not identical, help the model better learn to distinguish the subtle features of the target object. - **Efficient framework**: Update only a small number of parameters and retain the core weights of the pre - trained model to ensure that the model does not forget existing knowledge. Through these methods, Yo’LLaV A can effectively learn and understand personalized concepts while maintaining extensive pre - trained knowledge, thus achieving more natural and personalized interactions.