Scene captioning with deep fusion of images and point clouds

Qiang Yu,Chunxia Zhang,Lubin Weng,Shiming Xiang,Chunhong Pan
DOI: https://doi.org/10.1016/j.patrec.2022.04.017
IF: 4.757
2022-06-01
Pattern Recognition Letters
Abstract:Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.
computer science, artificial intelligence
What problem does this paper attempt to address?