Abstract:Image harmonization is a critical task in computer vision, which aims to adjust the foreground to make it compatible with the background. Recent works mainly focus on using global transformations (i.e., normalization and color curve rendering) to achieve visual consistency. However, these models ignore local visual consistency and their huge model sizes limit their harmonization ability on edge devices. In this paper, we propose a hierarchical dynamic network (HDNet) to adapt features from local to global view for better feature transformation in efficient image harmonization. Inspired by the success of various dynamic models, local dynamic (LD) module and mask-aware global dynamic (MGD) module are proposed in this paper. Specifically, LD matches local representations between the foreground and background regions based on semantic similarities, then adaptively adjust every foreground local representation according to the appearance of its K-nearest neighbor background regions. In this way, LD can produce more realistic images at a more fine-grained level, and simultaneously enjoy the characteristic of semantic alignment. The MGD effectively applies distinct convolution to the foreground and background, learning the representations of foreground and background regions as well as their correlations to the global harmonization, facilitating local visual consistency for the images much more efficiently. Experimental results demonstrate that the proposed HDNet significantly reduces the total model parameters by more than 80% compared to previous methods, while still attaining state-of-the-art performance on the popular iHarmony4 dataset. Additionally, we introduced a lightweight version of HDNet, i.e., HDNet-lite, which has only 0.65MB parameters, yet it still achieve competitive performance. Our code is avaliable at https://github.com/chenhaoxing/HDNet.

Video Harmonization with Triplet Spatio-Temporal Variation Patterns

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

TSA2: Temporal Segment Adaptation and Aggregation for Video Harmonization

Temporally Coherent Video Harmonization Using Adversarial Networks

Harmonizer: Learning to Perform White-Box Image and Video Harmonization

HYouTube: Video Harmonization Dataset

Quality Harmonization for Virtual Composition in Online Video Communications

Video-Language Alignment via Spatio-Temporal Graph Transformer

RaSTFormer: region-aware spatiotemporal transformer for visual homogenization recognition in short videos

SSH: A Self-Supervised Framework for Image Harmonization

Deep Image Harmonization with Globally Guided Feature Transformation and Relation Distillation

Image harmonization with Simple Hybrid CNN-Transformer Network

Image Harmonization with Region-wise Contrastive Learning

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Hierarchical Dynamic Image Harmonization

Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring

VidToMe: Video Token Merging for Zero-Shot Video Editing

DiffHarmony: Latent Diffusion Model Meets Image Harmonization

Video Editing with Temporal, Spatial and Appearance Consistency

Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring

Temporal Tessellation: A Unified Approach for Video Analysis