MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection

Tong Ning,Ke Lu,Xirui Jiang,Jian Xue
2024-11-20
Abstract:Utilizing temporal information to improve the performance of 3D detection has made great progress recently in the field of autonomous driving. Traditional transformer-based temporal fusion methods suffer from quadratic computational cost and information decay as the length of the frame sequence increases. In this paper, we propose a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space. Moreover, we design a Motion Elimination module to remove the relatively static objects for temporal fusion. On the standard nuScenes benchmark, our proposed MambaDETR achieves remarkable result in the 3D object detection task, exhibiting state-of-the-art performance among existing temporal fusion methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in multi - view 3D object detection: 1. **The computational complexity and information attenuation problems of traditional Transformer - based temporal fusion methods**: - When existing Transformer - based temporal fusion methods handle long - sequence frames, the computational cost will increase quadratically with the increase of the frame - sequence length (i.e., the time complexity is \(O(N^2)\)), which limits the number of frames they can handle. - As the frame - sequence length increases, these methods will also experience information attenuation, causing the model to focus more on the information of the current frame and ignore the long - term historical information. 2. **The influence of redundant static objects on the efficiency of temporal fusion**: - Among multiple frames, many objects are relatively static. Incorporating these static objects into the temporal fusion process will lead to unnecessary computational overhead and reduce the efficiency of the model. To solve these problems, the paper proposes a new method - **MambaDETR**. The main contributions of this method include: - **Proposing an efficient temporal fusion method based on the state - space model (SSM)**: By performing temporal fusion in the hidden space, MambaDETR can effectively model long - distance information while maintaining linear memory and computational complexity. - **Introducing the Motion Elimination Module**: This module improves the fusion efficiency and reduces the computational cost by removing relatively static objects and only retaining moving objects for temporal fusion. Specifically, the workflow of MambaDETR is as follows: 1. **2D - priors - based query initialization**: Use a 2D detector to generate high - quality 2D proposals and convert them into 3D queries through 3D projection. 2. **Motion Elimination**: Align the objects in the previous frame through the ego - vehicle transformation and generate motion masks according to the relative motion of the objects to remove static objects. 3. **Query Mamba**: Utilize the Structured State Space Layer to achieve query - based temporal fusion and avoid pairwise comparison, thereby achieving long - distance modeling. The experimental results show that MambaDETR performs excellently in the nuScenes benchmark test, especially in the 3D object detection task. Compared with the existing temporal fusion methods, it achieves higher performance and lower computational complexity.