Two-Stage Merging Network for Describing Traffic Scenes in Intelligent Vehicle Driving System

Heng Song,Junwu Zhu,Yi Jiang
DOI: https://doi.org/10.1109/tits.2021.3083656
IF: 8.5
2022-01-01
IEEE Transactions on Intelligent Transportation Systems
Abstract:Intelligent vehicle driving systems aim to control the driving behavior of a vehicle in real time without human intervention by perceiving and monitoring the surrounding environment. Describing images of traffic scenes automatically, which is one of the key problems of intelligent vehicle driving technology, has drawn attention since its inception. In recent years, a variety of automatic image description technologies have been proposed, among which the attention-based encoder-decoder framework achieved good results. In this paper we will discuss the fusing of a variety of information from multiple aspects of the images of traffic scenes. First, we will introduce visual attention, text attention and image topics attention which generates the weighted visual features, the attentive text information and the global image topics information respectively. We will then propose an adaptive two-stage merging network based on an encoder-decoder framework, which can fully integrate the three kinds of information in two stages, while automatically calculating the proportions of the information at each time step. Numerous experiments conducted on COCO2014 and Flickr30K datasets have demonstrated the effectiveness and advantages of the proposed method.
What problem does this paper attempt to address?