Personalized Image Generation with Large Multimodal Models

Yiyan Xu,Wenjie Wang,Yang Zhang,Tang Biao,Peng Yan,Fuli Feng,Xiangnan He
2024-10-18
Abstract:Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.
Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address several key issues in personalized image generation: 1. **Accurately capturing users' visual preferences**: Existing personalized image generation methods face challenges in accurately capturing users' visual preferences from historical images of user interactions, which often contain diverse and complex user interests. 2. **Handling multimodal instructions**: The multimodal instructions provided by users (such as reference images and text instructions) need to be accurately understood and used to generate the target image, requiring the model to have strong multimodal understanding and reasoning capabilities. 3. **Data scarcity issue**: There is a lack of supervised data to train personalized image generation models, especially datasets containing triplets of historical images of user interactions, multimodal instructions, and personalized target images. To address these issues, the paper proposes a personalized image generation framework named Pigeon, which employs large multimodal models (LMMs) and captures users' visual preferences and needs through three dedicated modules: 1. **Mask Generation Module**: Creates feature-level masks through a mask generator to filter out noise signals from historical images. 2. **Personalization Module**: Combines the masked historical tokens and the semantic features of multimodal instructions to generate personalized tokens that reflect users' content needs. 3. **Image Generation Module**: Converts the generated personalized tokens into visual features and generates personalized target images through a diffusion model (DM). Additionally, to address the data scarcity issue, Pigeon adopts a two-stage preference alignment scheme: 1. **Stage 1: Masked Preference Reconstruction**: Assumes that historical images of user interactions, despite containing some noise, still partially reflect users' implicit preferences. By constructing a supervised dataset and performing supervised fine-tuning, the model can extract user preferences from historical images and reconstruct target images. 2. **Stage 2: Paired Preference Alignment**: Utilizes the Direct Preference Optimization (DPO) method to optimize the model through pairs of preferences (selection and rejection) to generate more personalized images. Through these methods, Pigeon performs excellently in personalized sticker and movie poster generation tasks, outperforming various baseline methods in quantitative evaluations and receiving high scores in human evaluations.