Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

Jing Wang,Jinhui Tang,Jiebo Luo
DOI: https://doi.org/10.1145/3394171.3413753
2020-01-01
Abstract:OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.
What problem does this paper attempt to address?