MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

Baijiong Lin,Weisen Jiang,Pengguang Chen,Shu Liu,Ying-Cong Chen
2024-08-27
Abstract:Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based and Transformer-based methods. The code is available at <a class="link-external link-https" href="https://github.com/EnVision-Research/MTMamba" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively capture long-range dependencies and enhance cross-task interactions in multi-task dense scene understanding. Specifically, multi-task dense scene understanding requires training a model to simultaneously handle multiple dense prediction tasks, such as semantic segmentation, monocular depth estimation, surface normal estimation, and object boundary detection. Existing methods face the following challenges when dealing with these tasks: 1. **Long-range dependencies**: Existing convolutional neural networks (CNNs) mainly capture local features and struggle to effectively model long-range dependencies. 2. **Cross-task interactions**: In multi-task learning, the exchange and interaction of information between different tasks are crucial for improving overall performance. To address these issues, the paper proposes MTMamba++, a new architecture based on the Mamba decoder. MTMamba++ enhances the modeling of long-range dependencies and cross-task interactions by introducing two core modules—the Self-Task Mamba (STM) block and the Cross-Task Mamba (CTM) block. Specifically: - **STM block**: Utilizes the State Space Model (SSM) mechanism to effectively capture the global contextual information of each task. - **CTM block**: Designs two variants—F-CTM and S-CTM, which enhance cross-task interactions from the feature level and the semantic level, respectively. Through these innovations, MTMamba++ demonstrates superior performance over CNN-based and Transformer-based methods on multiple standard datasets (such as NYUDv2, PASCAL-Context, and Cityscapes).