Visual Matching is Enough for Scene Text Retrieval.

Lilong Wen,Yingrong Wang,Dongxiang Zhang,Gang Chen
DOI: https://doi.org/10.1145/3539597.3570428
2023-01-01
Abstract:Given a text query, the task of scene text retrieval aims at searching and localizing all the text instances that are contained in an image gallery. The state-of-the-art method learns a cross-modal similarity between the query text and the detected text regions in natural images to facilitate retrieval. However, this cross-modal approach still cannot well bridge the heterogeneity gap between the text and image modalities. In this paper, we propose a new paradigm that converts the task into a single-modality retrieval problem. Unlike previous works that rely on character recognition or embedding, we directly leverage pictorial information by rendering query text into images to learn the glyph feature of each character, which can be utilized to capture the similarity between query and scene text images. With the extracted visual features, we devise a synthetic label image guided feature alignment mechanism that is robust to different scene text styles and layouts. The modules of glyph feature learning, text instance detection, and visual matching are jointly trained in an end-to-end framework. Experimental results show that our proposed paradigm achieves the best performance in multiple benchmark datasets. As a side product, our method can also be easily generalized to support text queries with unseen characters or languages in a zero-shot manner.
What problem does this paper attempt to address?