InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Stewart Worrall

2024-04-29

Abstract:This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to enhance performance further. Extensive experiments performed on the nuScenes and SemanticKITTI datasets reveal that our approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code has been made available at:

Computer Vision and Pattern Recognition,Robotics

What problem does this paper attempt to address?

This paper aims to solve the problem of converting multi - view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods usually rely on depth estimation, device - specific operators or transformer queries to construct 3D volumes, which limits the wide application of 3D occupancy models. In contrast, the method proposed in this paper simplifies this process by using two projection matrices to store static mapping relationships and efficiently generate global bird - eye - view (BEV) features and local 3D feature volumes through matrix multiplication. Specifically, the method achieves this by performing matrix multiplication between the multi - view image feature maps and two sparse projection matrices. In addition, the author also introduces a sparse matrix processing technique to optimize GPU memory usage and proposes a global - local attention fusion module to integrate global BEV features and local 3D feature volumes to finally obtain the 3D volume. To further improve performance, a multi - scale supervision mechanism is also adopted. Experimental results show that this method not only performs well in terms of simplicity and effectiveness, but also achieves top - level performance in detecting vulnerable road users (VRU), which is crucial for autonomous driving and road safety.

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction

3Dopformer: 3D Occupancy Perception from Multi-Camera Images with Directional and Distance Enhancement

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

OccupancyDETR: Using DETR for Mixed Dense-sparse 3D Occupancy Prediction

Fully Sparse 3D Occupancy Prediction

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

AdaptiveOcc: Adaptive Octree-based Network for Multi-Camera 3D Semantic Occupancy Prediction in Autonomous Driving

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Real-time 3D semantic occupancy prediction for autonomous vehicles using memory-efficient sparse convolution

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation

OccRWKV: Rethinking Efficient 3D Semantic Occupancy Prediction with Linear Complexity