Visual and semantic guided scene text retrieval

Luo, Hailong
DOI: https://doi.org/10.1007/s11227-024-06268-6
IF: 3.3
2024-06-10
The Journal of Supercomputing
Abstract:In this paper, we introduce a novel end-to-end trainable network for the task of scene text retrieval. Diverging from the state-of-the-art methods that match the visual features of individual character images for retrieval, our network transforms the entire query text into a single query image. By integrating visual and language modules, our network extracts rich visual and semantic features from the query image, facilitating efficient similarity modeling and query matching. This hybrid embedding approach using visual-semantic features of query images shows excellent robustness in dealing with complex text styles and layouts. Experimental results on multiple benchmark datasets validate the superiority of our framework, especially in multilingual retrieval tasks, where our framework achieves a 20.15% increase in mAP score compared to the current state of the art. This significant performance boost showcases the potent potential of our network in multilingual scene text retrieval tasks.
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?