Abstract:Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks.

LiDAR-Camera Continuous Fusion in Voxelized Grid for Semantic Scene Completion

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Attention-based Multi-modal Fusion Network for Semantic Scene Completion.

Voxel- and Bird's-Eye-View-Based Semantic Scene Completion for LiDAR Point Clouds

PVI-Net: Point-Voxel-Image Fusion for Semantic Segmentation of Point Clouds in Large-Scale Autonomous Driving Scenarios

Geometry-semantic Aware for Monocular 3D Semantic Scene Completion

DepthSSC: Depth-Spatial Alignment and Dynamic Voxel Resolution for Monocular 3D Semantic Scene Completion

SLCF-Net: Sequential LiDAR-Camera Fusion for Semantic Scene Completion using a 3D Recurrent U-Net

CasFusionNet: A Cascaded Network for Point Cloud Semantic Scene Completion by Dense Feature Fusion

SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion

Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds

PointSSC: A Cooperative Vehicle-Infrastructure Point Cloud Benchmark for Semantic Scene Completion

2D Semantic-Guided Semantic Scene Completion

LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network

Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

Multi-modal Fusion Architecture Search for Camera-Based Semantic Scene Completion

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

LiDAR-Based Real-Time Panoptic Segmentation via Spatiotemporal Sequential Data Fusion

PV-SSD: A Multi-Modal Point Cloud Feature Fusion Method for Projection Features and Variable Receptive Field Voxel Features