Abstract:Object detection and semantic segmentation are two fundamental techniques for Intelligent Vehicles (IV) and Advanced Driving Assistance System (ADAS). Motivated by recent studies demonstrating that object detection and semantic segmentation are two highly-correlated tasks, this paper handles the problem of joint object detection and semantic segmentation in traffic scenes. Existing methods perform the joint object detection and semantic segmentation by sharing the same backbone network, but always ignore the interactive connection between the subdividing detection branch and segmentation branch, leading to the insufficient interaction between the two branches. Considering this situation, this paper proposes a joint object detection and semantic segmentation model with the cross-attention and inner-attention mechanisms. The cross-attention mechanism enables to build up the essential interaction between the subdividing detection branch and segmentation branch to fully make use of their correlation. In addition, the inner-attention contributes to strengthening the representations of feature maps in the model. Given an image, an encoder-decoder network is firstly used to extract initial feature maps. Then, the inner-attention mechanism is applied to strengthen the initial feature maps to obtain segmentation feature maps. Subsequently, the cross-attention mechanism utilizes the segmentation feature maps to guide the generation of object detection feature maps. Finally, the semantic segmentation is performed on the segmentation feature maps and object detection is performed on the detection feature maps. In the experiments, two well-known public traffic datasets are used to evaluate our model. Our model achieves the highest performance in comparison with several recently-proposed methods. In addition, some ablation studies are conducted to evaluate the proposed inner-attention and cross-attention mechanisms, and experiment results validate their effectiveness.

A Serial-Parallel Self-Attention Network Joint with Multi-Scale Dilated Convolution.

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Adaptive multi-scale dual attention network for semantic segmentation

High-Resolution Remote Sensing Image Semantic Segmentation Method Based on Improved Encoder-Decoder Convolutional Neural Network

A Joint Object Detection and Semantic Segmentation Model with Cross-Attention and Inner-Attention Mechanisms

Hierarchical Self-Attention Embedded Neural Network With Dense Connection for Remote-Sensing Image Semantic Segmentation

Multi-scale Matching Networks for Semantic Correspondence

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

Dual attention deep fusion semantic segmentation networks of large-scale satellite remote-sensing images

Hybrid Dilated Convolution Network Using Attentive Kernels for Real-Time Semantic Segmentation

High-Resolution Remote Sensing Image Semantic Segmentation via Multiscale Context and Linear Self-Attention

Chemical signalling in the nervous system.

Progressive Scene Segmentation Based on Self-Attention Mechanism.

Multilevel feature fusion dilated convolutional network for semantic segmentation

An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Semantic Segmentation With Multi Scale Spatial Attention For Self Driving Cars

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation

DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network

Multiscale Fusion Convolutional Network in Real-time Semantic Segmentation

Semantic Image Segmentation Based On Attentions To Intra Scales And Inner Channels

Semantic Segmentation of Aerial Imagery Via Split-Attention Networks with Disentangled Nonlocal and Edge Supervision