Learning Accurate Monocular 3d Voxel Representation Via Bilateral Voxel Transformer

Tianheng Cheng,Haoyi Jiang,Shaoyu Chen,Bencheng Liao,Qian Zhang,Wenyu Liu,Xinggang Wang
DOI: https://doi.org/10.1016/j.imavis.2024.105237
IF: 3.86
2024-01-01
Image and Vision Computing
Abstract:Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present Bilateral Voxel Transformer (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high- resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on GitHub.
What problem does this paper attempt to address?