Abstract:Depth information can contribute to the semantic segmentation of scenes from red–green–blue (RGB) images. Therefore, the amount of information that can be obtained from RGB and RGB-depth (RGB-D) images is significantly greater for this task. However, RGB and RGB-D modalities are different in terms of object representation. Features that are extracted from these modalities and fused effectively are key to scene semantic segmentation. In addition, complete segmentation requires the fusion of multiscale features to unify global information. However, existing approaches primarily use multiscale features for sequential integration. This study introduces a cross-modal and progressive feature fusion network (CMPFFNet) for semantic segmentation of indoor scenes in RGB-D images. First, a multimodal adaptive alignment fusion (MAAF) module based on an attention mechanism is introduced. This module aligns the two modal channels by additive attention and then computes the spatial similarity between the two modalities based on the dot product to incorporate the complementary information of the depth modality into the RGB modality. In addition, a reverse attention augmentation (RAA) module is introduced to augment the more abstract high-level features for two adjacent multilevel features using the concrete semantic information of the lower-level features in them. After augmenting the extracted multilevel features, a multilevel feature progressive fusion (MFPF) module is deployed; this module sequentially fuses the neighboring two features progressively with emphasis on the spatial semantics. The network uses the Segformer network with high performance as a backbone in multiple computer vision tasks to enhance the segmentation capability. Experimental results obtained from two publicly available datasets of indoor scenes reveal that the proposed CMPFFNet outperforms existing models in semantic segmentation of indoor scenes of RGB-D images. Note to Practitioners—This study introduces a cross-modal and progressive feature fusion network (CMPFFNet) for indoor scene semantic segmentation in RGB-D images. The complementary information of the depth modality is incorporated into the RGB modality in both channel and spatial forms to form a discriminative representation for easy segmentation. A multilevel feature aggregation decoder is proposed to predict the results of semantic segmentation of scenes. The network uses the Segformer network with high performance as a backbone in multiple computer vision tasks to enhance the segmentation capability.

Self-Enhanced Feature Fusion for RGB-D Semantic Segmentation

NLFNet: Non-Local Fusion Towards Generalized Multimodal Semantic Segmentation Across RGB-Depth, Polarization, and Thermal Images

ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation.

Deep Feature Selection-And-Fusion for RGB-D Semantic Segmentation

An RGB-D Fusion Based Semantic Segmentation Algorithm Based on Neighborhood Metric Relations

Incorporating Luminance, Depth and Color Information by a Fusion-based Network for Semantic Segmentation

Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation

Depth Cue Enhancement and Guidance Network for RGB-D Salient Object Detection

FAFNet: Fully aligned fusion network for RGBD semantic segmentation based on hierarchical semantic flows

CMPFFNet: Cross-Modal and Progressive Feature Fusion Network for RGB-D Indoor Scene Semantic Segmentation

A Depth Awareness and Learnable Feature Fusion Network for Enhanced Geometric Perception in Semantic Correspondence

Multi-type and Multi-level Feature Fusion Network for RGBD Indoor Semantic Segmentation

FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time Semantic Segmentation

RFBNet: Deep Multimodal Networks with Residual Fusion Blocks for RGB-D Semantic Segmentation

Multi-Modal Attention-based Fusion Model for Semantic Segmentation of RGB-Depth Images

RGB×D: Learning Depth-Weighted RGB Patches for RGB-D Indoor Semantic Segmentation

Dual Attention Based Multi-scale Feature Fusion Network for Indoor RGBD Semantic Segmentation.

SIESEF-FusionNet: Spatial Inter-correlation Enhancement and Spatially-Embedded Feature Fusion Network for LiDAR Point Cloud Semantic Segmentation

Discriminative feature fusion for RGB-D salient object detection

Multi-branch Differential Bidirectional Fusion Network for RGB-T Semantic Segmentation

Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection