Abstract:Object detection and semantic segmentation are two fundamental techniques for Intelligent Vehicles (IV) and Advanced Driving Assistance System (ADAS). Motivated by recent studies demonstrating that object detection and semantic segmentation are two highly-correlated tasks, this paper handles the problem of joint object detection and semantic segmentation in traffic scenes. Existing methods perform the joint object detection and semantic segmentation by sharing the same backbone network, but always ignore the interactive connection between the subdividing detection branch and segmentation branch, leading to the insufficient interaction between the two branches. Considering this situation, this paper proposes a joint object detection and semantic segmentation model with the cross-attention and inner-attention mechanisms. The cross-attention mechanism enables to build up the essential interaction between the subdividing detection branch and segmentation branch to fully make use of their correlation. In addition, the inner-attention contributes to strengthening the representations of feature maps in the model. Given an image, an encoder-decoder network is firstly used to extract initial feature maps. Then, the inner-attention mechanism is applied to strengthen the initial feature maps to obtain segmentation feature maps. Subsequently, the cross-attention mechanism utilizes the segmentation feature maps to guide the generation of object detection feature maps. Finally, the semantic segmentation is performed on the segmentation feature maps and object detection is performed on the detection feature maps. In the experiments, two well-known public traffic datasets are used to evaluate our model. Our model achieves the highest performance in comparison with several recently-proposed methods. In addition, some ablation studies are conducted to evaluate the proposed inner-attention and cross-attention mechanisms, and experiment results validate their effectiveness.

Two-Stage Merging Network for Describing Traffic Scenes in Intelligent Vehicle Driving System

A Fusion Method Aiming at Environmental Perception of Autonomous Vehicle Based on Visual Scheme

&Lt;title>automatic Traffic Real-Time Analysis System Based on Video</title>

Spatiotemporal Analysis of Static and Dynamic Traffic Elements from Road Scenes.

Avtmnet: Adaptive Visual-Text Merging Network for Image Captioning

The Traffic Scene Understanding and Prediction Based on Image Captioning

A system of vision sensor based deep neural networks for complex driving scene analysis in support of crash risk assessment and prevention

A Joint Object Detection and Semantic Segmentation Model with Cross-Attention and Inner-Attention Mechanisms

A Semantic Communication Approach for Multiscene Target Detection in Intelligent Vehicle Networks

Multi-Dimensional Traffic Congestion Detection Based on Fusion of Visual Features and Convolutional Neural Network

Remote sensing traffic scene retrieval based on learning control algorithm for robot multimodal sensing information fusion and human-machine interaction and collaboration

A Scene Understanding Network Based on Driving Scene

A Unified Spatio-Temporal Description Model of Environment for Intelligent Vehicles

Spatiotemporal Feature Enhancement Aids the Driving Intention Inference of Intelligent Vehicles

Traffic Light Recognition for Complex Scene with Fusion Detections.

Driving Behavior Recognition Algorithm Combining Attention Mechanism and Lightweight Network

Toward Effective Traffic Sign Detection via Two-Stage Fusion Neural Networks

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

MFCANet: A Road Scene Segmentation Network Based on Multi-Scale Feature Fusion and Context Information Aggregation

Evaluation of Connected Vehicle Identification-Aware Mixed Traffic Freeway Cooperative Merging

Context-Aware Attention Encoder-Decoder Network for Connected Heavy-Duty Vehicle Aggressive Driving Identification under Naturalistic Driving Conditions