Abstract:Depth information can contribute to the semantic segmentation of scenes from red–green–blue (RGB) images. Therefore, the amount of information that can be obtained from RGB and RGB-depth (RGB-D) images is significantly greater for this task. However, RGB and RGB-D modalities are different in terms of object representation. Features that are extracted from these modalities and fused effectively are key to scene semantic segmentation. In addition, complete segmentation requires the fusion of multiscale features to unify global information. However, existing approaches primarily use multiscale features for sequential integration. This study introduces a cross-modal and progressive feature fusion network (CMPFFNet) for semantic segmentation of indoor scenes in RGB-D images. First, a multimodal adaptive alignment fusion (MAAF) module based on an attention mechanism is introduced. This module aligns the two modal channels by additive attention and then computes the spatial similarity between the two modalities based on the dot product to incorporate the complementary information of the depth modality into the RGB modality. In addition, a reverse attention augmentation (RAA) module is introduced to augment the more abstract high-level features for two adjacent multilevel features using the concrete semantic information of the lower-level features in them. After augmenting the extracted multilevel features, a multilevel feature progressive fusion (MFPF) module is deployed; this module sequentially fuses the neighboring two features progressively with emphasis on the spatial semantics. The network uses the Segformer network with high performance as a backbone in multiple computer vision tasks to enhance the segmentation capability. Experimental results obtained from two publicly available datasets of indoor scenes reveal that the proposed CMPFFNet outperforms existing models in semantic segmentation of indoor scenes of RGB-D images. Note to Practitioners—This study introduces a cross-modal and progressive feature fusion network (CMPFFNet) for indoor scene semantic segmentation in RGB-D images. The complementary information of the depth modality is incorporated into the RGB modality in both channel and spatial forms to form a discriminative representation for easy segmentation. A multilevel feature aggregation decoder is proposed to predict the results of semantic segmentation of scenes. The network uses the Segformer network with high performance as a backbone in multiple computer vision tasks to enhance the segmentation capability.

CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images

Multi-Scale Cross-Attention Fusion Network Based on Image Super-Resolution

ACMFNet: Attention-Based Cross-Modal Fusion Network for Building Extraction of Remote Sensing Images

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

DGFNet: Dual Gate Fusion Network for Land Cover Classification in Very High-Resolution Images

MGFN: A Multi-Granularity Fusion Convolutional Neural Network for Remote Sensing Scene Classification

Local–Global Multiscale Fusion Network for Semantic Segmentation of Buildings in SAR Imagery

Building Multi-Feature Fusion Refined Network for Building Extraction from High-Resolution Remote Sensing Images

FDGSNet: A Multi-modal Gated Segmentation Network for Remote Sensing Image Based on Frequency Decomposition

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

CTMFNet: CNN and Transformer Multiscale Fusion Network of Remote Sensing Urban Scene Imagery

A Hybrid Attention-Aware Fusion Network (HAFNet) for Building Extraction from High-Resolution Imagery and LiDAR Data

B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery

Multi-Field Context Fusion Network for Semantic Segmentation of High-Spatial-Resolution Remote Sensing Images

MCFNet: Multi-scale Covariance Feature Fusion Network for Real-time Semantic Segmentation

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

CMPFFNet: Cross-Modal and Progressive Feature Fusion Network for RGB-D Indoor Scene Semantic Segmentation

EMAFF-Net: an enhanced multi-scale attentive feature fusion network for building extraction from VHR remote sensing images

CIMFNet: Cross-layer Interaction and Multiscale Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Images

CMEFusion: Cross-Modal Enhancement and Fusion of FIR and Visible Images