Abstract:As bird’s-eye-view (BEV) semantic segmentation is simple-to-visualize and easy-to-handle, it has been applied in autonomous driving to provide the surrounding information to downstream tasks. Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. The recent work implemented this task by learning the content and position relationship via Vision Transformer (ViT). However, its quadratic complexity confines the relationship learning only in the latent layer, leaving the scale gap to impede the representation of fine-grained objects. In view of information absorption, when representing position-related BEV features, their weighted fusion of all view feature imposes inconducive features to disturb the fusion of conducive features. To tackle these issues, we propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inference. Specifically, we devise a hierarchical framework to refine the BEV feature representation, where the last size is only half of the final segmentation. To save the computation increase caused by this hierarchical framework, we exploit the cross-scale Transformer to learn feature relationships in a reversed-aligning way, and leverage the residual connection of BEV features to facilitate information transmission between scales. We propose correspondence-augmented attention to distinguish conducive and inconducive correspondences. It is implemented in a simple yet effective way, amplifying attention scores before the Softmax operation, so that the position-view-related and the position-view-disrelated attention scores are highlighted and suppressed. Extensive experiments demonstrate that our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.

CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Structure-Aware Cross-Modal Transformer for Depth Completion

CASSC: Context‐aware Method for Depth Guided Semantic Scene Completion

2D Semantic-Guided Semantic Scene Completion

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

Resolution-switchable 3D Semantic Scene Completion

Semantic Scene Completion with Cleaner Self

CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection

CTVSR: Collaborative Spatial-Temporal Transformer for Video Super-Resolution

Multi-view 3D Reconstruction with Transformer

Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

DepthSSC: Depth-Spatial Alignment and Dynamic Voxel Resolution for Monocular 3D Semantic Scene Completion

3D Sketch-aware Semantic Scene Completion Via Semi-supervised Structure Prior

A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View Semantic Segmentation

A Cross-Scale Hierarchical Transformer With Correspondence-Augmented Attention for Inferring Bird’s-Eye-View Semantic Segmentation

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization