Partial Scene Text Retrieval

Hao Wang,Minghui Liao,Zhouyi Xie,Wenyu Liu,Xiang Bai
2024-11-18
Abstract:The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at <a class="link-external link-https" href="https://github.com/lanfeng4659/PSTR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to simultaneously locate and search for text instances identical or similar to the given query text and their partial fragments in scene - text retrieval. Existing methods can only handle text - line instances and are unable to solve the problem of searching for partial fragments within these text - line instances, mainly because of the lack of annotations for partial fragments in the training data. For this reason, the author proposes a network that can retrieve text - line instances and their partial fragments simultaneously. Specifically, this method embeds the query text and scene - text instances into a shared feature space and measures the cross - modal similarity between them. To handle partial fragments, this method adopts a multi - instance learning (MIL) approach to learn their similarity to the query text without additional annotations. In addition, the paper also proposes the Ranked Multi - Instance Learning (RankMIL) method to adaptively filter those noisy samples, as well as the Dynamic Partial Matching Algorithm (DPMA) to directly search for target partial fragments from text - line instances during the inference stage, thereby greatly improving search efficiency and retrieval performance. Through this method, the paper achieves significant performance improvements on English and Chinese datasets respectively.