Abstract:Fusion of images acquired using different sensors generates a single output with enhanced information for high-level visual perception applications. The transformer architecture has demonstrated its powerful ability to obtain important global contextual dependencies for multi-modal image fusion tasks. However, transformer-based image fusion methods face many critical issues, such as incurring huge computational burdens, limited ability to learn local features, and the difficulty of handling images of arbitrary sizes. To address the above limits, we proposed a novel Laplacian Pyramid Hybrid (LapH) network to combine the advantages of CNN and transformer architectures for multi-modal image fusion tasks. With the divide-and-conquer philosophy, we first build a light-weight CNN-based branch, performing effective extraction and fusion of texture/edge features via central difference convolutions, to process the high-resolution components with abundant details encoded in the lower pyramid levels of the Laplacian pyramid. Then, we design a transformer-based branch to process the low-resolution base components, learning long-range dependencies of global-contextual features without incurring extensive computational loads. Here, we design a multi-scale recurrent modulation mechanism to integrate the edge/texture features from the CNN branch as guidance to progressively refine the feature extraction and fusion on low-frequency components. Finally, we propose a new multi-scale spatial consistency loss term based on the neighbor contrast in source images, generating fused images with more natural and realistic appearances. Extensive experiments on two different multi-modal image fusion tasks verify the superiority of our method. The source codes are made publicly available at https://github.com/rgttadv/LapH .

Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Exploring refined dual visual features cross-combination for image captioning

Exploring and Distilling Cross-Modal Information for Image Captioning

Dual visual align-cross attention-based image captioning transformer

Dual-level Collaborative Transformer for Image Captioning

Layer-wise enhanced transformer with multi-modal fusion for image caption

Feature Fusion Based on Transformer for Cross-modal Retrieval

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

Dynamic-balanced Double-Attention Fusion for Image Captioning

CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion

Cross-Modality Fusion Transformer for Multispectral Object Detection

A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning

TSFNet: Triple-Steam Image Captioning

Context-Aware Transformer for image captioning

Scene captioning with deep fusion of images and point clouds

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

MACTFusion: Lightweight Cross Transformer for Adaptive Multimodal Medical Image Fusion

Delving Into Precise Attention In Image Captioning

Interaction augmented transformer with decoupled decoding for video captioning

Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

Tag‐inferring and tag‐guided Transformer for image captioning