Abstract:Utilizing temporal information to improve the performance of 3D detection has made great progress recently in the field of autonomous driving. Traditional transformer-based temporal fusion methods suffer from quadratic computational cost and information decay as the length of the frame sequence increases. In this paper, we propose a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space. Moreover, we design a Motion Elimination module to remove the relatively static objects for temporal fusion. On the standard nuScenes benchmark, our proposed MambaDETR achieves remarkable result in the 3D object detection task, exhibiting state-of-the-art performance among existing temporal fusion methods.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in multi - view 3D object detection: 1. **The computational complexity and information attenuation problems of traditional Transformer - based temporal fusion methods**: - When existing Transformer - based temporal fusion methods handle long - sequence frames, the computational cost will increase quadratically with the increase of the frame - sequence length (i.e., the time complexity is \(O(N^2)\)), which limits the number of frames they can handle. - As the frame - sequence length increases, these methods will also experience information attenuation, causing the model to focus more on the information of the current frame and ignore the long - term historical information. 2. **The influence of redundant static objects on the efficiency of temporal fusion**: - Among multiple frames, many objects are relatively static. Incorporating these static objects into the temporal fusion process will lead to unnecessary computational overhead and reduce the efficiency of the model. To solve these problems, the paper proposes a new method - **MambaDETR**. The main contributions of this method include: - **Proposing an efficient temporal fusion method based on the state - space model (SSM)**: By performing temporal fusion in the hidden space, MambaDETR can effectively model long - distance information while maintaining linear memory and computational complexity. - **Introducing the Motion Elimination Module**: This module improves the fusion efficiency and reduces the computational cost by removing relatively static objects and only retaining moving objects for temporal fusion. Specifically, the workflow of MambaDETR is as follows: 1. **2D - priors - based query initialization**: Use a 2D detector to generate high - quality 2D proposals and convert them into 3D queries through 3D projection. 2. **Motion Elimination**: Align the objects in the previous frame through the ego - vehicle transformation and generate motion masks according to the relative motion of the objects to remove static objects. 3. **Query Mamba**: Utilize the Structured State Space Layer to achieve query - based temporal fusion and avoid pairwise comparison, thereby achieving long - distance modeling. The experimental results show that MambaDETR performs excellently in the nuScenes benchmark test, especially in the 3D object detection task. Compared with the existing temporal fusion methods, it achieves higher performance and lower computational complexity.

MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Query-based Temporal Fusion with Explicit Motion for 3D Object Detection

MambaBEV: An efficient 3D detection model with Mamba2

Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

LiDAR-Based 3D Temporal Object Detection via Motion-Aware LiDAR Feature Fusion

TSC-BEV: Temporal-Spatial Feature Consistency 3D Object Detection

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences

MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving

Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving.

DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention

BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection