Abstract:Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the scene text detection task, the existing prompt tuning methods focus too much on global features and ignore fine - grained text features. Specifically, when dealing with scene text detection, current methods usually use text prompts to supplement the input of the text encoder, mainly focusing on global features while ignoring fine - grained details, resulting in the neglect of fine - grained text. This limits the accuracy of text detection, especially when it is necessary to recognize text in small fonts or with complex backgrounds. To overcome this problem, the authors propose the Region Prompt Tuning (RPT) method, which is specifically designed for fine - grained scene text detection. RPT decomposes the visual feature map into regional visual tokens by introducing regional text prompts and establishes a one - to - one correspondence between characters and tokens, enabling each character to match the local features of its corresponding regional visual token, thus avoiding the omission of detailed features and fine - grained text. In addition, RPT also introduces mechanisms such as shared position embedding and bidirectional distance loss to further optimize the alignment between characters and tokens and improve the performance of text detection. The main contributions of the paper include: 1. **Introduction of regional text prompts**: Generate a regional scoring map through shared position embedding and character - token interaction, strengthen local features, and avoid ignoring fine - grained text at the token level. 2. **Bidirectional distance loss**: Make regional text prompts focus on the detection target "text" and at the same time make general text prompts focus on fine - grained features. 3. **Dual - matching method**: Combine the global scoring map and the regional scoring map, and balance global and local features through feature enhancement and fusion. Experimental results show that RPT performs excellently on multiple benchmark datasets (such as ICDAR2015, TotalText, and CTW1500), significantly improving the performance of scene text detection.

Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt

Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

R-Tuning: Regularized Prompt Tuning in Open-Set Scenarios

Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion

CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Positional Prompt Tuning for Efficient 3D Representation Learning

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

LIPT: Improving Prompt Tuning with Late Inception Reparameterization

Prompt Tuning with Soft Context Sharing for Vision-Language Models

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models

APrompt: Attention Prompt Tuning for Efficient Adaptation of Pre-trained Language Models

Pro-tuning: Unified Prompt Tuning for Vision Tasks

Towards Unified Prompt Tuning for Few-shot Text Classification

Understanding Prompt Tuning for V-L Models Through the Lens of Neural Collapse

PPT: Pre-trained Prompt Tuning for Few-shot Learning