Abstract:As bird’s-eye-view (BEV) semantic segmentation is simple-to-visualize and easy-to-handle, it has been applied in autonomous driving to provide the surrounding information to downstream tasks. Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. The recent work implemented this task by learning the content and position relationship via Vision Transformer (ViT). However, its quadratic complexity confines the relationship learning only in the latent layer, leaving the scale gap to impede the representation of fine-grained objects. In view of information absorption, when representing position-related BEV features, their weighted fusion of all view feature imposes inconducive features to disturb the fusion of conducive features. To tackle these issues, we propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inference. Specifically, we devise a hierarchical framework to refine the BEV feature representation, where the last size is only half of the final segmentation. To save the computation increase caused by this hierarchical framework, we exploit the cross-scale Transformer to learn feature relationships in a reversed-aligning way, and leverage the residual connection of BEV features to facilitate information transmission between scales. We propose correspondence-augmented attention to distinguish conducive and inconducive correspondences. It is implemented in a simple yet effective way, amplifying attention scores before the Softmax operation, so that the position-view-related and the position-view-disrelated attention scores are highlighted and suppressed. Extensive experiments demonstrate that our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.

A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Representation Separation for Semantic Segmentation with Vision Transformers

A Cross-Scale Hierarchical Transformer With Correspondence-Augmented Attention for Inferring Bird’s-Eye-View Semantic Segmentation

Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images

A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View Semantic Segmentation

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Efficient Transformer for Remote Sensing Image Segmentation

Vision Transformers: From Semantic Segmentation to Dense Prediction

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Class-Guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery

Semantic segmentation using cross-stage feature reweighting and efficient self-attention

A transformer-based approach empowered by a self-attention technique for semantic segmentation in remote sensing

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Vision Transformer with Sparse Scan Prior

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

Glance-and-Gaze Vision Transformer