Abstract:More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To address this issue, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position. To improve the model's perception of the background, we further utilize an additional loss function for background characters classification in the denoising training <a class="link-external link-http" href="http://part.Although" rel="external noopener nofollow">this http URL</a> DNTextSpotter is conceptually simple, it outperforms the state-of-the-art methods on four benchmarks (Total-Text, SCUT-CTW1500, ICDAR15, and Inverse-Text), especially yielding an improvement of 11.3% against the best approach in Inverse-Text dataset.

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

Transferring General Multimodal Pretrained Models to Text Recognition

Exploring the Capacity of an Orderless Box Discretization Network for Multi-orientation Scene Text Detection

OPMP: An Omnidirectional Pyramid Mask Proposal Network for Arbitrary-Shape Scene Text Detection

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Decoder Pre-Training with only Text for Scene Text Recognition

Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning

DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training

PP-OCR: A Practical Ultra Lightweight OCR System

MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Advance One-Shot Multispectral Instance Detection With Text's Supervision

Turning a CLIP Model into a Scene Text Detector

MorphText: Deep Morphology Regularized Arbitrary-shape Scene Text Detection

OTE: Exploring Accurate Scene Text Recognition Using One Token