A Text Detector Based on the Specific Text Prompt

Xingtao Lin,Chuanyang Gong,Lanxiao Wang,Heqian Qiu,Shengyu Tong,Hongliang Li
DOI: https://doi.org/10.1109/icip51287.2024.10647439
2024-01-01
Abstract:Nowadays, the prompt tuning has emerged as a novel new paradigm for adapting the original large-scale Contrastive Language-Image Pre-trained (CLIP) model into the downstream task as text detection. However, the learnable prompt adopted by the existing methods of prompt-tuning represents blurry and abstract meanings instead of fine-grained text feature. In this paper, we propose a powerful and robust text detector, called STP-TD, utilizing the specific text prompt and a learnable visual mask to fully apply the prior knowledge of CLIP model into the downstream task of text detection. STP-TD aims to make the split prompt character focusing on an ordered image token by Transformer mechanism. It is anticipated that a prompt character could stand for the fine-grained text feature of an image token, thus a distance loss is added to rectify prompt through optimization. Additionally, STP-TD firstly proposes a learnable visual mask to refine the text region in advance. Meanwhile, a synergetic framework is introduced as a bridge between visual branch and text branch. We also adopt the pixel-text matching process to align every pixel of image feature with the text feature. The experiments are conducted on the datasets ICDAR2015 and TotalText and outperform the state of the art.
What problem does this paper attempt to address?