Abstract:Fusion of images acquired using different sensors generates a single output with enhanced information for high-level visual perception applications. The transformer architecture has demonstrated its powerful ability to obtain important global contextual dependencies for multi-modal image fusion tasks. However, transformer-based image fusion methods face many critical issues, such as incurring huge computational burdens, limited ability to learn local features, and the difficulty of handling images of arbitrary sizes. To address the above limits, we proposed a novel Laplacian Pyramid Hybrid (LapH) network to combine the advantages of CNN and transformer architectures for multi-modal image fusion tasks. With the divide-and-conquer philosophy, we first build a light-weight CNN-based branch, performing effective extraction and fusion of texture/edge features via central difference convolutions, to process the high-resolution components with abundant details encoded in the lower pyramid levels of the Laplacian pyramid. Then, we design a transformer-based branch to process the low-resolution base components, learning long-range dependencies of global-contextual features without incurring extensive computational loads. Here, we design a multi-scale recurrent modulation mechanism to integrate the edge/texture features from the CNN branch as guidance to progressively refine the feature extraction and fusion on low-frequency components. Finally, we propose a new multi-scale spatial consistency loss term based on the neighbor contrast in source images, generating fused images with more natural and realistic appearances. Extensive experiments on two different multi-modal image fusion tasks verify the superiority of our method. The source codes are made publicly available at https://github.com/rgttadv/LapH .

Transformer-guided Feature Pyramid Network for Multi-View Stereo

Multiscale 3-D-2-D Mixed CNN and Lightweight Attention-Free Transformer for Hyperspectral and LiDAR Classification

Attention-enhanced multi-source cost volume multi-view stereo

MFE‐MVSNet: Multi‐scale feature enhancement multi‐view stereo with bi‐directional connections

Feature‐enhanced representation with transformers for multi‐view stereo

MTD-MVSNet: Multi-view Stereo Network with Multi-scale Transformer and Dual Attention

Enhanced feature pyramid for multi-view stereo with adaptive correlation cost volume

Exploring the Point Feature Relation on Point Cloud for Multi-view Stereo

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

Multi-View Stereo Network Based on Attention Mechanism and Neural Volume Rendering

Multi-view depth estimation based on multi-feature aggregation for 3D reconstruction

Multi-View Stereo Representation Revist: Region-Aware MVSNet

OD-MVSNet: Omni-dimensional dynamic multi-view stereo network

PA-MVSNet: Sparse-to-Dense Multi-View Stereo With Pyramid Attention

CT-MVSNet: Curvature-guided multi-view stereo with transformers

FA-MSVNet: multi-scale and multi-view feature aggregation methods for stereo 3D reconstruction

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

HC-MVSNet: A Probability Sampling-Based Multi-View-stereo Network with Hybrid Cascade Structure for 3D Reconstruction

Attention Aware Cost Volume Pyramid Based Multi-view Stereo Network for 3D Reconstruction