Abstract:Bird's eye view (BEV) semantic maps have evolved into a crucial element of urban intelligent traffic management and monitoring, offering invaluable visual and significant data representations for informed intelligent city decision making. Nevertheless, current methodologies continue underutilizing the temporal information embedded within dynamic frames throughout the BEV feature transformation process. This limitation results in decreased accuracy when mapping high-speed moving objects, particularly in capturing their shape and dynamic trajectory. A framework is proposed for cross-view semantic segmentation to address this challenge, leveraging simulated environments as a starting point before applying it to real-life urban imaginative transportation scenarios. The view converter module is thoughtfully designed to collate information from multiple initial view observations captured from various angles and modes. This module outputs a top-down view semantic graph characterized by its object space layout to preserve beneficial temporal information in BEV transformation. The NuScenes dataset is used to evaluate model effectiveness. A novel application is also devised that harnesses transformer networks to map images and video sequences into top-down or comprehensive bird's-eye views. By combining physics-based and constraint-based formulations and conducting ablation studies, the approach has been substantiated, highlighting the significance of context above and below a given point in generating these maps. This innovative method has been thoroughly validated on the NuScenes dataset. Notably, it has yielded state-of-the-art instantaneous mapping results, with particular benefits observed for smaller dynamic category displays. The experimental findings include comparing axial attention with the state-of-the-art (SOTA) model, demonstrating the performance enhancement associated with temporal awareness.

Progressive Temporal Transformer for Bird’s-Eye-View Camera Pose Estimation

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs

BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation

Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation

Predicting Maps Using In-Vehicle Cameras for Data-Driven Intelligent Transport

Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception

Local Supports Global: Deep Camera Relocalization With Sequence Enhancement

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

WidthFormer: Toward Efficient Transformer-based BEV View Transformation

From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration

BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation

BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment

FedBEVT: Federated Learning Bird's Eye View Perception Transformer in Road Traffic Systems

EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

C-BEV: Contrastive Bird's Eye View Training for Cross-View Image Retrieval and 3-DoF Pose Estimation

CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

6D Camera Relocalization in Visually Ambiguous Extreme Environments

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception