Abstract:Combining color (RGB) images with thermal images can facilitate semantic segmentation of poorly lit urban scenes. However, for RGB-thermal (RGB-T) semantic segmentation, most existing models address cross-modal feature fusion by focusing only on exploring the samples while neglecting the connections between different samples. Additionally, although the importance of boundary, binary, and semantic information is considered in the decoding process, the differences and complementarities between different morphological features are usually neglected. In this paper, we propose a novel RGB-T semantic segmentation network, called MMSMCNet, based on modal memory fusion and morphological multiscale assistance to address the aforementioned problems. For this network, in the encoding part, we used SegFormer for feature extraction of bimodal inputs. Next, our modal memory sharing module implements staged learning and memory sharing of sample information across modal multiscales. Furthermore, we constructed a decoding union unit comprising three decoding units in a layer-by-layer progression that can extract two different morphological features according to the information category and realize the complementary utilization of multiscale cross-modal fusion information. Each unit contains a contour positioning module based on detail information, a skeleton positioning module with deep features as the primary input, and a morphological complementary module for mutual reinforcement of the first two types of information and construction of semantic information. Based on this, we constructed a new supervision strategy, that is, a multi-unit-based complementary supervision strategy. Extensive experiments using two standard datasets showed that MMSMCNet outperformed related state-of-the-art methods. The code is available at: https://github.com/2021nihao/MMSMCNet.

Semantic-guided RGB-Thermal Crowd Counting with Segment Anything Model

Relevant Region Prediction for Crowd Counting

Semantic-refined Spatial Pyramid Network for Crowd Counting

Multi-objects Real Time Recognition Based on Color Information

RGB-T Multi-Modal Crowd Counting Based on Transformer

Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM

MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting

CGINet: Cross-modality Grade Interaction Network for RGB-T Crowd Counting

An RGB-D Fusion Based Semantic Segmentation Algorithm Based on Neighborhood Metric Relations

Semantic Reconstruction based on RGB Image and Sparse Depth

Region-adaptive and context-complementary cross modulation for RGB-T semantic segmentation

Multi-modal Crowd Counting via Modal Emulation

Semantic Heads Segmentation and Counting in Crowded Retail Environment with Convolutional Neural Networks Using Top View Depth Images

Research on 24-Hour Dense Crowd Counting and Object Detection System Based on Multimodal Image Optimization Feature Fusion

Graph Enhancement and Transformer Aggregation Network for RGB-Thermal Crowd Counting

Body Structure Aware Deep Crowd Counting.

SSR-HEF: Crowd Counting with Multi-Scale Semantic Refining and Hard Example Focusing

Scale-Aware Network with Regional and Semantic Attentions for Crowd Counting under Cluttered Background

DCFNet: Dense Complementary Fusion for RGB-Thermal Urban Scene Perception

Deep Spatial Regression Model for Image Crowd Counting

MMSMCNet: Modal Memory Sharing and Morphological Complementary Networks for RGB-T Urban Scene Semantic Segmentation