The Contextual Loss for Image Transformation with Non-Aligned Data

Roey Mechrez,Itamar Talmi,Lihi Zelnik-Manor
DOI: https://doi.org/10.48550/arXiv.1803.02077
2018-07-18
Abstract:Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image. Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations. However, for many tasks, aligned training pairs of images will not be available. We present an alternative loss function that does not require alignment, thus providing an effective and simple solution for a new space of problems. Our loss is based on both context and semantics -- it compares regions with similar semantic meaning, while considering the context of the entire image. Hence, for example, when transferring the style of one face to another, it will translate eyes-to-eyes and mouth-to-mouth. Our code can be found at <a class="link-external link-https" href="https://www.github.com/roimehrez/contextualLoss" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively perform image transformation in image transformation tasks when the training data is non - aligned. Traditional methods usually rely on pixel - level loss functions to measure the similarity between the generated image and the target image. These methods assume that the images are spatially aligned, that is, the pixels at the same position can be compared. However, in many practical applications, such as semantic style transfer, single - image animation, puppet control, and unpaired domain transfer tasks, the training data is often non - aligned, which means that traditional pixel - level loss functions cannot be directly used. For this reason, the author proposes a new loss function - Contextual Loss. This loss function does not require the images to be spatially aligned, thus providing a simple and effective solution. The Contextual Loss is compared based on the content and semantics of the image. It not only considers the similarity of features but also the context of the entire image. Therefore, even in the case of spatial deformation between images, image transformation can be effectively performed. Specifically, the Contextual Loss is implemented in the following ways: 1. **Feature Representation**: Represent each image as a set of high - dimensional points (features). 2. **Context Similarity**: Define a context similarity measure for comparing features in two images. If a feature finds the most similar matching feature in another image, then these two features are considered context - similar. 3. **Loss Function**: Define a loss function based on context similarity. This loss function is optimized between the generated image and the target image to ensure that the generated image is similar to the target image in content and semantics. Through this method, the Contextual Loss can handle non - aligned data and has achieved excellent results in multiple image transformation tasks, such as semantic style transfer, single - image animation, puppet control, and unpaired domain transfer.