LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing

Aoyang Liu,Qingnan Fan,Shuai Qin,Hong Gu,Yansong Tang
2024-06-25
Abstract:Although recent years have witnessed significant advancements in image editing thanks to the remarkable progress of text-to-image diffusion models, the problem of non-rigid image editing still presents its complexities and challenges. Existing methods often fail to achieve consistent results due to the absence of unique identity characteristics. Thus, learning a personalized identity prior might help with consistency in the edited results. In this paper, we explore a novel task: learning the personalized identity prior for text-based non-rigid image editing. To address the problems in jointly learning prior and editing the image, we present LIPE, a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject, and subsequently employ the model with learned prior for non-rigid image editing. Experimental results demonstrate the advantages of our approach in various editing scenarios over past related leading methods in qualitative and quantitative ways.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of maintaining identity consistency in non - rigid image editing. Specifically, existing methods often fail to faithfully preserve the identity characteristics of the subject when editing images, especially when performing non - rigid transformations (such as changing postures, expressions or viewpoints), which easily leads to inconsistent editing results. Therefore, the author proposes a new task: improving the consistency of non - rigid image editing results by learning the personalized identity prior. ### Main problem description in the paper 1. **Limitations of existing methods**: - Existing methods usually rely on the general domain prior of large - scale text - to - image (T2I) models. Although these models have strong generation capabilities, they perform poorly in preserving personalized identity characteristics. - These methods mainly rely on less controllable text prompts and are prone to modifying unnecessary image regions. - Although some recent works have attempted to customize personalized face priors, they require a large number of reference images (about 100), and are limited to portrait editing and cannot achieve more extensive non - rigid editing. 2. **Research objectives**: - Given a small number (3 - 5) of reference images of the same identity, can a personalized identity prior be learned to promote the non - rigid editing of test images while maintaining the unique properties of the identity? - Propose a new framework that can achieve high - quality non - rigid image editing while maintaining identity characteristics. ### Overview of the solution To solve the above problems, the author proposes a two - stage framework named LIPE (Learning personalized Identity Prior for non - rigid image Editing): 1. **Learning of personalized identity prior**: - Use a limited number of reference images to fine - tune the pre - trained T2I model to learn the personalized identity prior. - Generate detailed text - image pairs through data augmentation techniques to improve the model's understanding and generation ability of non - rigid attributes. 2. **Non - rigid image editing**: - Utilize the Identity - aware mask blend (NIMA) technique to precisely control the target object during the editing process and avoid changes in the background and other irrelevant attributes. ### Main contributions - **Introduce a new task**: Non - rigid image editing of personalized identity prior. - **Propose a new method**: The LIPE framework, which effectively solves the technical problems of personalized identity prior learning and non - rigid editing. - **Establish a new dataset**: A dataset specifically designed for this task, covering multiple categories of objects, for evaluating model performance. Through experimental verification, LIPE is significantly superior to existing methods in terms of maintaining identity consistency, background consistency and editing satisfaction.