Accurate and Complete Captions for Question-controlled Text-aware Image Captioning

Yehuan Wang,Jian Hu,Lin Shang
DOI: https://doi.org/10.1109/ICME55011.2023.00475
2023-01-01
Abstract:Question-controlled Text-aware Image Captioning (Qc-TextCap), is the task of generating a distinctive scene text aware caption according to several personalized questions when given an image. However, due to the diversity of visual scene, it is hard for current Optical Character Recognition (OCR) systems to extract complete scene text sentences from the images. Besides, existing works are limited in their use of question features and visual features. In this paper, we propose a Multimodal Transformer plus Scene text clustering and Cross modal attention (MTSC) to tackle the above challenges. We devise scene text clustering to group relevant scene text pieces which are detected as separate results by current OCR systems. To better utilize the information in questions and images, we design cross modal attention to enrich the features of both modalities. We extensively evaluate our model on the two Qc-TextCap datasets and superior results are achieved when comparing to state-of-the-art approaches.
What problem does this paper attempt to address?