ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Jinke Li,Xiao He,Chonghua Zhou,Xiaoqiang Cheng,Yang Wen,Dan Zhang
2024-07-12
Abstract:3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, including map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes are available at \url{<a class="link-external link-https" href="https://github.com/ViewFormerOcc/ViewFormer-Occ" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the technical issues of 3D spatial perception in autonomous driving scenarios, particularly focusing on multi-view 3D occupancy perception and occupancy flow representation in dynamic scenes. Specifically, the paper proposes a new framework called ViewFormer to tackle the problems encountered by existing methods when converting multi-view image features to 3D space. The main issues include: 1. **Multi-view feature aggregation**: Existing projection-based methods (projection-first deformable attention) are limited by sensor deployment constraints, making it difficult to effectively aggregate features from different cameras. 2. **Dynamic scene representation**: For dynamic scenes, especially the representation of details such as the direction of object movement, there is a need for a more refined way to represent dynamic changes, i.e., occupancy flow. To address the above issues, the main contributions of the paper are as follows: 1. **Proposing a new learning-first view attention mechanism** for effectively aggregating multi-view features and overcoming the limitations of fixed reference points in traditional projection methods. 2. **Introducing the ViewFormer framework**, a Transformer-based framework that combines the newly proposed view attention and streaming temporal attention to enhance spatiotemporal modeling capabilities. 3. **Creating a high-quality occupancy flow dataset FlowOcc3D**, which is built upon the existing nuScenes and Occ3D datasets, providing fine-grained occupancy flow annotations to support research on dynamic scene representation. 4. **Demonstrating the superior performance of ViewFormer on multiple benchmarks**, achieving significant improvements compared to previous methods. Through these methods, the paper aims to enhance the understanding of the surrounding environment by autonomous driving systems, especially in handling complex dynamic scenes.