Abstract:Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.

What problem does this paper attempt to address?

The paper attempts to address several key issues in personalized image generation: 1. **Accurately capturing users' visual preferences**: Existing personalized image generation methods face challenges in accurately capturing users' visual preferences from historical images of user interactions, which often contain diverse and complex user interests. 2. **Handling multimodal instructions**: The multimodal instructions provided by users (such as reference images and text instructions) need to be accurately understood and used to generate the target image, requiring the model to have strong multimodal understanding and reasoning capabilities. 3. **Data scarcity issue**: There is a lack of supervised data to train personalized image generation models, especially datasets containing triplets of historical images of user interactions, multimodal instructions, and personalized target images. To address these issues, the paper proposes a personalized image generation framework named Pigeon, which employs large multimodal models (LMMs) and captures users' visual preferences and needs through three dedicated modules: 1. **Mask Generation Module**: Creates feature-level masks through a mask generator to filter out noise signals from historical images. 2. **Personalization Module**: Combines the masked historical tokens and the semantic features of multimodal instructions to generate personalized tokens that reflect users' content needs. 3. **Image Generation Module**: Converts the generated personalized tokens into visual features and generates personalized target images through a diffusion model (DM). Additionally, to address the data scarcity issue, Pigeon adopts a two-stage preference alignment scheme: 1. **Stage 1: Masked Preference Reconstruction**: Assumes that historical images of user interactions, despite containing some noise, still partially reflect users' implicit preferences. By constructing a supervised dataset and performing supervised fine-tuning, the model can extract user preferences from historical images and reconstruct target images. 2. **Stage 2: Paired Preference Alignment**: Utilizes the Direct Preference Optimization (DPO) method to optimize the model through pairs of preferences (selection and rejection) to generate more personalized images. Through these methods, Pigeon performs excellently in personalized sticker and movie poster generation tasks, outperforming various baseline methods in quantitative evaluations and receiving high scores in human evaluations.

Personalized Image Generation with Large Multimodal Models

FaceChain: A Playground for Identity-Preserving Portrait Generation

Imaginique Expressions: Tailoring Personalized Short-Text-to-Image Generation Through Aesthetic Assessment and Human Insights

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Personalized Representation from Personalized Generation

Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

Two Birds with One Stone: Transforming and Generating Facial Images with Iterative GAN

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

PMG : Personalized Multimodal Generation with Large Language Models

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

User-Friendly Customized Generation with Multi-Modal Prompts

Imagine yourself: Tuning-Free Personalized Image Generation

Human Aesthetic Preference-Based Large Text-to-Image Model Personalization: Kandinsky Generation as an Example

Parameter-Guided Image Generation with Denoising Diffusion Probabilistic Models

Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

Fast Personalized Text to Image Synthesis with Attention Injection

Unified Text-to-Image Generation and Retrieval

Towards Universal Multi-Modal Personalization: A Language Model Empowered Generative Paradigm

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models