Cross-region Feature Fusion with Geometrical Relationship for OCR-based Image Captioning

Jinfei Zhou,Cheng Yang,Yaping Zhu,Yana Zhang
DOI: https://doi.org/10.1016/j.neucom.2024.128197
IF: 6
2024-01-01
Neurocomputing
Abstract:Automatically generating a readable sentence that describes the text-contained image is a challenging task. Compared to traditional image captioning algorithms, OCR-based image captioning focuses on reading OCR tokens in images and understanding them with the image content to generate descriptions. However, existing research mainly concentrate on improving the quantity and quality of obtaining OCR tokens and exploring their spatial relationships while lacking investigation into how to effectively join OCR tokens with image content. This paper proposes a cross-region feature fusion with a geometrical relationship Transformer(CFGR-Transformer) for OCR-based image captioning. The network first establishes the associations between the OCR and object regions of the image by constructing relative geometric relationships, including width/height difference, distance, IOU(Intersection over Union), inclusion relationships, and angles offset, and then incorporates intra-region and cross-region features to aggregate entities from different modalities by a multi-head attention mechanism based on relative relationships. Benefiting from the guidance of the relative relationship, visual entities like OCR tokens and object regions can consider multiple relative relationships as the attention weight for feature fusion within each subspace. Extensive experiments conducted on the TextCaps dataset demonstrate the effectiveness of the proposed CFGR-Transformer method. In particular, our results on the online testing of TextCaps achieve an improvement in CIDEr score from 93.0% to 98.2%.
What problem does this paper attempt to address?