Abstract:Fusion of images acquired using different sensors generates a single output with enhanced information for high-level visual perception applications. The transformer architecture has demonstrated its powerful ability to obtain important global contextual dependencies for multi-modal image fusion tasks. However, transformer-based image fusion methods face many critical issues, such as incurring huge computational burdens, limited ability to learn local features, and the difficulty of handling images of arbitrary sizes. To address the above limits, we proposed a novel Laplacian Pyramid Hybrid (LapH) network to combine the advantages of CNN and transformer architectures for multi-modal image fusion tasks. With the divide-and-conquer philosophy, we first build a light-weight CNN-based branch, performing effective extraction and fusion of texture/edge features via central difference convolutions, to process the high-resolution components with abundant details encoded in the lower pyramid levels of the Laplacian pyramid. Then, we design a transformer-based branch to process the low-resolution base components, learning long-range dependencies of global-contextual features without incurring extensive computational loads. Here, we design a multi-scale recurrent modulation mechanism to integrate the edge/texture features from the CNN branch as guidance to progressively refine the feature extraction and fusion on low-frequency components. Finally, we propose a new multi-scale spatial consistency loss term based on the neighbor contrast in source images, generating fused images with more natural and realistic appearances. Extensive experiments on two different multi-modal image fusion tasks verify the superiority of our method. The source codes are made publicly available at https://github.com/rgttadv/LapH .

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

Multiscale 3-D-2-D Mixed CNN and Lightweight Attention-Free Transformer for Hyperspectral and LiDAR Classification

Combining transformers with CNN for multi-focus image fusion

HDCCT: Hybrid Densely Connected CNN and Transformer for Infrared and Visible Image Fusion

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

THFuse: An Infrared and Visible Image Fusion Network using Transformer and Hybrid Feature Extractor

Image Fusion Transformer

MDC-RHT: Multi-Modal Medical Image Fusion via Multi-Dimensional Dynamic Convolution and Residual Hybrid Transformer

Trans2Fuse: Empowering image fusion through self-supervised learning and multi-modal transformations via transformer networks

HDCTfusion: Hybrid Dual-Branch Network Based on CNN and Transformer for Infrared and Visible Image Fusion

Pyramid Fully Convolutional Network for Hyperspectral and Multispectral Image Fusion.

A Joint Convolutional Cross ViT Network for Hyperspectral and Light Detection and Ranging Fusion Classification

FuseFormer: A Transformer for Visual and Thermal Image Fusion

MACTFusion: Lightweight Cross Transformer for Adaptive Multimodal Medical Image Fusion

Multi-modal medical image fusion based on densely-connected high-resolution CNN and hybrid transformer

HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

GRPAFusion: A Gradient Residual and Pyramid Attention-Based Multiscale Network for Multimodal Image Fusion

A multimodal hyper-fusion transformer for remote sensing image classification

SFPFusion: An Improved Vision Transformer Combining Super Feature Attention and Wavelet-Guided Pooling for Infrared and Visible Images Fusion

Fusionmlp: A Mlp-Based Unified Image Fusion Framework

Multi-modal Image Fusion with the Hybrid ℓ0ℓ1 Layer Decomposing and Multi-Directional Filter Banks