Cross-Image Attention for Zero-Shot Appearance Transfer

Yuval Alaluf,Daniel Garibi,Or Patashnik,Hadar Averbuch-Elor,Daniel Cohen-Or

2023-11-07

Abstract:Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.

Computer Vision and Pattern Recognition,Graphics

What problem does this paper attempt to address?

The paper aims to address the problem of cross-image appearance transfer. Specifically, its goal is to transfer the appearance (such as texture, color, and other visual features) from one image to the structure of another image without additional training or optimization. This process requires establishing semantic correspondences between objects of different shapes, sizes, and even viewpoints. Traditional approaches usually require model training for specific object categories or assume that the source and target images have similar shapes. In contrast, the method proposed in this paper can accomplish this task in a zero-shot setting and is applicable to different categories of objects. By leveraging the cross-image attention mechanism in a pre-trained text-to-image generation model, this method can establish strong semantic associations between different images and achieve high-quality appearance transfer. Furthermore, to enhance the quality of the output images, the authors introduce three mechanisms to improve the manipulation of noise latent codes during the image generation process. Experimental results show that this method can be effectively applied to a wide range of object categories and demonstrates good robustness when handling input images with significant differences in shape, size, and viewpoint.

Cross-Image Attention for Zero-Shot Appearance Transfer

GENERATING MANIFOLD-ALIGNED SEMANTIC FEATURE FOR ZERO-SHOT LEARNING

Text-to-image Generation Based on Spatial-Channel Attention and Semantic Redescription

Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

Multi-modal Generative Adversarial Network for Zero-Shot Learning

Attribute-Guided Network for Cross-Modal Zero-Shot Hashing

Better Transferability with Attribute Attention for Generalized Zero-Shot Learning

Semantic Enhanced Cross-modal GAN for Zero-shot Learning

Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

Multi-Channel Attention Selection GAN With Cascaded Semantic Guidance for Cross-View Image Translation

Side-Scan Sonar Image Classification With Zero-Shot and Style Transfer

Multiscale Visual-Attribute Co-Attention for Zero-Shot Image Recognition

Integrating Adversarial Generative Network with Variational Autoencoders Towards Cross-Modal Alignment for Zero-Shot Remote Sensing Image Scene Classification

Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning

Inter-Modality Fusion Based Attention for Zero-Shot Cross-Modal Retrieval.

Fine-grained Appearance Transfer with Diffusion Models

Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Progressive Cross-Modal Semantic Network for Zero-Shot Sketch-Based Image Retrieval

Asymmetric Generative Adversarial Networks with a New Attention Mechanism

Data-Aware Zero-Shot Neural Architecture Search for Image Recognition