Beyond visual semantics: Exploring the role of scene text in image understanding

Arka Ujjal Dey,Suman K. Ghosh,Ernest Valveny,Gaurav Harit
DOI: https://doi.org/10.1016/j.patrec.2021.06.011
IF: 4.757
2021-09-01
Pattern Recognition Letters
Abstract:<p>Images with visual and scene text content are ubiquitous in everyday life. However, current image interpretation systems are mostly limited to using only the visual features, neglecting to leverage the scene text content. In this paper, we propose to jointly use scene text and visual channels for robust semantic interpretation of images. We not only extract and encode visual and scene text cues but also model their interplay to generate a contextual joint embedding with richer semantics. The contextual embedding thus generated is applied to retrieval and classification tasks on multimedia images with scene text content to demonstrate its effectiveness. In the retrieval framework, we augment the contextual semantic representation with scene text cues to mitigate vocabulary misses that may have occurred during the semantic embedding. To deal with irrelevant or erroneous scene text recognition, we also apply query-based attention to the text channel. We show that our multi-channel approach, involving contextual semantics and scene text, improves upon the absolute accuracy of the current state-of-the-art methods on Advertisement Images Dataset by 8.9% in the relevant statement retrieval task and by 5% in the topic classification task.</p>
computer science, artificial intelligence
What problem does this paper attempt to address?