Cross-Image Attention for Zero-Shot Appearance Transfer

Yuval Alaluf,Daniel Garibi,Or Patashnik,Hadar Averbuch-Elor,Daniel Cohen-Or
2023-11-07
Abstract:Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The paper aims to address the problem of cross-image appearance transfer. Specifically, its goal is to transfer the appearance (such as texture, color, and other visual features) from one image to the structure of another image without additional training or optimization. This process requires establishing semantic correspondences between objects of different shapes, sizes, and even viewpoints. Traditional approaches usually require model training for specific object categories or assume that the source and target images have similar shapes. In contrast, the method proposed in this paper can accomplish this task in a zero-shot setting and is applicable to different categories of objects. By leveraging the cross-image attention mechanism in a pre-trained text-to-image generation model, this method can establish strong semantic associations between different images and achieve high-quality appearance transfer. Furthermore, to enhance the quality of the output images, the authors introduce three mechanisms to improve the manipulation of noise latent codes during the image generation process. Experimental results show that this method can be effectively applied to a wide range of object categories and demonstrates good robustness when handling input images with significant differences in shape, size, and viewpoint.