Abstract:Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks.

Learning Accurate Monocular 3d Voxel Representation Via Bilateral Voxel Transformer

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

UniVision: A Unified Framework for Vision-Centric 3D Perception

Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

Instance-Aware Monocular 3D Semantic Scene Completion

Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection

MonoOcc: Digging into Monocular Semantic Occupancy Prediction

Large-Scale 3D Semantic Mapping Using Monocular Vision

Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments

Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

Trans4Map: Revisiting Holistic Bird's-Eye-View Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers