Generating Spatial-aware Captions for TextCaps

Yehuan Wang,Lin Shang
DOI: https://doi.org/10.1109/icpr56361.2022.9956709
2022-01-01
Abstract:TextCaps, also known as image captioning with reading comprehension, is the task of automatically describing images with both visual objects and scene text in them. It is more challenging than conventional image captioning since it requires models to read scene text and cover them in generated captions. Recently, various models have achieved excellent results on this task. However, existing approaches are limited in their use of spatial relationships between all the visual entities (both visual objects and scene text). Captions from these models can hardly describe the explicit spatial relations between the text and relevant objects. In this paper, we propose a Spatial Relationship Incorporated Multimodal Transformer (SRIMT) to generate spatial-aware captions. Firstly, we construct a spatial graph to fully explore the spatial relationships between all the visual entities. Then, we present a novel spatially aware self-attention layer which is the core of our model to incorporate spatial relationship. Through this layer, attention of two visual entities is considered only when they are connected in the spatial graph. Furthermore, each head in our multi-head attention module is designed to focus on only one type of spatial relation defined by the specific graph. Compared with fully-connected transformer-based architectures, our model can learn the spatial relationships of a visual scene more explicitly instead of dispersing attention among all visual entities. Strong spatial relations between OCR tokens and corresponding objects are established with our model. We extensively evaluate our model on the TextCaps dataset and superior results are achieved when comparing to state-of-the-art approaches. More remarkably, we improve CIDEr score from 93.0 to 95.8.
What problem does this paper attempt to address?