Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt

Xingtao Lin,Heqian Qiu,Lanxiao Wang,RUihang Wang,Linfeng XU,Hongliang Li
2024-09-20
Abstract:Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the scene text detection task, the existing prompt tuning methods focus too much on global features and ignore fine - grained text features. Specifically, when dealing with scene text detection, current methods usually use text prompts to supplement the input of the text encoder, mainly focusing on global features while ignoring fine - grained details, resulting in the neglect of fine - grained text. This limits the accuracy of text detection, especially when it is necessary to recognize text in small fonts or with complex backgrounds. To overcome this problem, the authors propose the Region Prompt Tuning (RPT) method, which is specifically designed for fine - grained scene text detection. RPT decomposes the visual feature map into regional visual tokens by introducing regional text prompts and establishes a one - to - one correspondence between characters and tokens, enabling each character to match the local features of its corresponding regional visual token, thus avoiding the omission of detailed features and fine - grained text. In addition, RPT also introduces mechanisms such as shared position embedding and bidirectional distance loss to further optimize the alignment between characters and tokens and improve the performance of text detection. The main contributions of the paper include: 1. **Introduction of regional text prompts**: Generate a regional scoring map through shared position embedding and character - token interaction, strengthen local features, and avoid ignoring fine - grained text at the token level. 2. **Bidirectional distance loss**: Make regional text prompts focus on the detection target "text" and at the same time make general text prompts focus on fine - grained features. 3. **Dual - matching method**: Combine the global scoring map and the regional scoring map, and balance global and local features through feature enhancement and fusion. Experimental results show that RPT performs excellently on multiple benchmark datasets (such as ICDAR2015, TotalText, and CTW1500), significantly improving the performance of scene text detection.