Abstract:3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, including map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes are available at \url{<a class="link-external link-https" href="https://github.com/ViewFormerOcc/ViewFormer-Occ" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper primarily addresses the technical issues of 3D spatial perception in autonomous driving scenarios, particularly focusing on multi-view 3D occupancy perception and occupancy flow representation in dynamic scenes. Specifically, the paper proposes a new framework called ViewFormer to tackle the problems encountered by existing methods when converting multi-view image features to 3D space. The main issues include: 1. **Multi-view feature aggregation**: Existing projection-based methods (projection-first deformable attention) are limited by sensor deployment constraints, making it difficult to effectively aggregate features from different cameras. 2. **Dynamic scene representation**: For dynamic scenes, especially the representation of details such as the direction of object movement, there is a need for a more refined way to represent dynamic changes, i.e., occupancy flow. To address the above issues, the main contributions of the paper are as follows: 1. **Proposing a new learning-first view attention mechanism** for effectively aggregating multi-view features and overcoming the limitations of fixed reference points in traditional projection methods. 2. **Introducing the ViewFormer framework**, a Transformer-based framework that combines the newly proposed view attention and streaming temporal attention to enhance spatiotemporal modeling capabilities. 3. **Creating a high-quality occupancy flow dataset FlowOcc3D**, which is built upon the existing nuScenes and Occ3D datasets, providing fine-grained occupancy flow annotations to support research on dynamic scene representation. 4. **Demonstrating the superior performance of ViewFormer on multiple benchmarks**, achieving significant improvements compared to previous methods. Through these methods, the paper aims to enhance the understanding of the surrounding environment by autonomous driving systems, especially in handling complex dynamic scenes.

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving

ViewFormer: View Set Attention for Multi-view 3D Shape Understanding

Multiview Fusion Driven 3-D Point Cloud Semantic Segmentation Based on Hierarchical Transformer

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

3Dopformer: 3D Occupancy Perception from Multi-Camera Images with Directional and Distance Enhancement

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction