Partial Scene Text Retrieval

Hao Wang,Minghui Liao,Zhouyi Xie,Wenyu Liu,Xiang Bai

2024-11-18

Abstract:The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at <a class="link-external link-https" href="https://github.com/lanfeng4659/PSTR" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to simultaneously locate and search for text instances identical or similar to the given query text and their partial fragments in scene - text retrieval. Existing methods can only handle text - line instances and are unable to solve the problem of searching for partial fragments within these text - line instances, mainly because of the lack of annotations for partial fragments in the training data. For this reason, the author proposes a network that can retrieve text - line instances and their partial fragments simultaneously. Specifically, this method embeds the query text and scene - text instances into a shared feature space and measures the cross - modal similarity between them. To handle partial fragments, this method adopts a multi - instance learning (MIL) approach to learn their similarity to the query text without additional annotations. In addition, the paper also proposes the Ranked Multi - Instance Learning (RankMIL) method to adaptively filter those noisy samples, as well as the Dynamic Partial Matching Algorithm (DPMA) to directly search for target partial fragments from text - line instances during the inference stage, thereby greatly improving search efficiency and retrieval performance. Through this method, the paper achieves significant performance improvements on English and Chinese datasets respectively.

Partial Scene Text Retrieval

Scene Text Retrieval Via Joint Text Detection and Similarity Learning

Visual Matching is Enough for Scene Text Retrieval.

PATS: Patch Area Transportation with Subdivision for Local Feature Matching.

Scene Text Identification by Leveraging Mid-level Patches and Context Information

SPTS v2: Single-Point Scene Text Spotting

MT: Multi-Perspective Feature Learning Network for Scene Text Detection

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with Partial Query

PERT: A Progressively Region-based Network for Scene Text Removal

Background-Insensitive Scene Text Recognition with Text Semantic Segmentation

Mlts: A Multi-Language Scene Text Spotter

I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-Shaped Scene Text Detection

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Flexible scene text recognition based on dual attention mechanism

Dual Relation Network for Scene Text Recognition

Mask TextSpotter V3: Segmentation Proposal Network for Robust Scene Text Spotting