MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

Baijiong Lin,Weisen Jiang,Pengguang Chen,Shu Liu,Ying-Cong Chen

2024-08-27

Abstract:Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based and Transformer-based methods. The code is available at <a class="link-external link-https" href="https://github.com/EnVision-Research/MTMamba" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively capture long-range dependencies and enhance cross-task interactions in multi-task dense scene understanding. Specifically, multi-task dense scene understanding requires training a model to simultaneously handle multiple dense prediction tasks, such as semantic segmentation, monocular depth estimation, surface normal estimation, and object boundary detection. Existing methods face the following challenges when dealing with these tasks: 1. **Long-range dependencies**: Existing convolutional neural networks (CNNs) mainly capture local features and struggle to effectively model long-range dependencies. 2. **Cross-task interactions**: In multi-task learning, the exchange and interaction of information between different tasks are crucial for improving overall performance. To address these issues, the paper proposes MTMamba++, a new architecture based on the Mamba decoder. MTMamba++ enhances the modeling of long-range dependencies and cross-task interactions by introducing two core modules—the Self-Task Mamba (STM) block and the Cross-Task Mamba (CTM) block. Specifically: - **STM block**: Utilizes the State Space Model (SSM) mechanism to effectively capture the global contextual information of each task. - **CTM block**: Designs two variants—F-CTM and S-CTM, which enhance cross-task interactions from the feature level and the semantic level, respectively. Through these innovations, MTMamba++ demonstrates superior performance over CNN-based and Transformer-based methods on multiple standard datasets (such as NYUDv2, PASCAL-Context, and Cityscapes).

MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models

DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning

Cross-Task Affinity Learning for Multitask Dense Scene Predictions

ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model

MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

HirMTL: Hierarchical Multi-Task Learning for dense scene understanding

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

VMamba: Visual State Space Model

DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction

VideoMamba: State Space Model for Efficient Video Understanding

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model