Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

Dongsheng Xu,Wenye Zhao,Yi Cai,Qingbao Huang
DOI: https://doi.org/10.1145/3581783.3612571
2023-01-01
Abstract:Text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still suffering from two issues. Firstly, current models cannot capture and generate scene text in non-Latin script languages, which severely limits the objectivity and the information completeness of generated captions. Secondly, current models tend to describe images with monotonous and templated style, which greatly limits the diversity of the generated captions. Although the above-mentioned issues can be alleviated through carefully designed annotations, this process is undoubtedly laborious and time-consuming. To address the above issues, we propose a Zero-shot Framework for Text-based Image Captioning (Zero-TextCap). Concretely, to generate candidate sentences starting from the prompt 'Image of' and iteratively refine them to improve the quality and diversity of captions, we introduce a Hybrid-sampling masked language model (H-MLM). To read multi-lingual scene text and model the relationships between them, we introduce a robust OCR system. To ensure that the captions generated by H-MLM contain scene text and are highly relevant to the image, we propose a CLIP-based generation guidance module to insert OCR tokens and filter candidate sentences. Our Zero-TextCap is capable of generalizing captions containing multi-lingual scene text and boosting the diversity of captions. Sufficient experiments demonstrate the effectiveness of our proposed Zero-TextCap. Our codes are available at https://github.com/Gemhuang79/Zero_TextCap.
What problem does this paper attempt to address?