Abstract:Fusion of images acquired using different sensors generates a single output with enhanced information for high-level visual perception applications. The transformer architecture has demonstrated its powerful ability to obtain important global contextual dependencies for multi-modal image fusion tasks. However, transformer-based image fusion methods face many critical issues, such as incurring huge computational burdens, limited ability to learn local features, and the difficulty of handling images of arbitrary sizes. To address the above limits, we proposed a novel Laplacian Pyramid Hybrid (LapH) network to combine the advantages of CNN and transformer architectures for multi-modal image fusion tasks. With the divide-and-conquer philosophy, we first build a light-weight CNN-based branch, performing effective extraction and fusion of texture/edge features via central difference convolutions, to process the high-resolution components with abundant details encoded in the lower pyramid levels of the Laplacian pyramid. Then, we design a transformer-based branch to process the low-resolution base components, learning long-range dependencies of global-contextual features without incurring extensive computational loads. Here, we design a multi-scale recurrent modulation mechanism to integrate the edge/texture features from the CNN branch as guidance to progressively refine the feature extraction and fusion on low-frequency components. Finally, we propose a new multi-scale spatial consistency loss term based on the neighbor contrast in source images, generating fused images with more natural and realistic appearances. Extensive experiments on two different multi-modal image fusion tasks verify the superiority of our method. The source codes are made publicly available at https://github.com/rgttadv/LapH .

Deep Laparoscopic Stereo Matching with Transformers

A Transformer-Based Architecture for High-Resolution Stereo Matching

Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers

End-to-end information fusion method for transformer-based stereo matching

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

Playing to Vision Foundation Model's Strengths in Stereo Matching

Efficient Stereo Matching Using Swin Transformer and Multilevel Feature Consistency in Autonomous Mobile Systems

Real-Time Image Stitching with Transformers for Complex Traffic Environment

Deep Stereo Matching With Hysteresis Attention and Supervised Cost Volume Construction

A Hybrid 2D and 3D Convolution Neural Network for Stereo Matching

Sliding Space-Disparity Transformer for Stereo Matching.

ChiTransformer:Towards Reliable Stereo from Cues

Transformer with Hybrid Attention Mechanism for Stereo Endoscopic Video Super Resolution

Steformer: Efficient Stereo Image Super-Resolution with Transformer

Semi-Dense Feature Matching with Transformers and Its Applications in Multiple-View Geometry

Deeply-fused Attentive Network for Stereo Matching

Self-Supervised Learning for Stereo Matching with Self-Improving Ability

Transformer-based stereo-aware 3D object detection from binocular images

Stereo matching of binocular laparoscopic images with improved densely connected neural architecture search

Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation

Multi-scale Alternated Attention Transformer for Generalized Stereo Matching