Abstract:There is a growing interest in scene text detection for arbitrary shapes. The effectiveness of text detection has also evolved from horizontal text detection to the ability to perform text detection in multiple directions and arbitrary shapes. However, scene text detection is still a challenging task due to significant differences in size and aspect ratio and diversity in shape, as well as orientation, coarse annotations, and other factors. Regression-based methods are inspired by object detection and have limitations in fitting the edges of arbitrarily shaped text due to the characteristics of their methods. Segmentation-based methods, on the other hand, perform prediction at the pixel level and thus can fit arbitrarily shaped text better. However, the inaccuracy of image text annotations and the distribution characteristics of text pixels, which contain a large number of background pixels and misclassified pixels, degrades the performance of segmentation-based text detection methods to some extent. Usually, considering whether a pixel belongs to a text region is highly dependent on the strength of the semantic information it has and the position of the pixel in the text area. Based on the above two points, we propose an innovative and robust method for scene text detection combining position and semantic information. First, we add position information to the images using a position encoding module (PosEM) to help the model learn the implicit feature relationships associated with the position. Second, we use the semantic enhancement module (SEM) to enhance the model's focus on the semantic information in the image during feature extraction. Then, to minimize the effect of noise due to inaccurate image text annotations and the distribution characteristics of text pixels, we convert the detection results into a probability map that can more reasonably represent the text distribution. Finally, we reconstruct and filter the text instances using a post-processing algorithm to reduce false positives. The experimental results show that our model improves significantly on the Total-Text, MSRA-TD500, and CTW1500 datasets, outperforming most previous advanced algorithms.

Text Position-Aware Pixel Aggregation Network with Adaptive Gaussian Threshold: Detecting Text in the Wild

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text Detection

Adaptive Segmentation Network for Scene Text Detection

HPNet: Text Detection Network with Hybrid Attention and Pixel Aggregation for Irregularly-Shaped Nearby Texts

MGPAN: Mask Guided Pixel Aggregation Network

ContourNet: Taking a Further Step Toward Accurate Arbitrary-shaped Scene Text Detection.

A Direct Regression Scene Text Detector with Position-Sensitive Segmentation

PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text

Tk-Text: Multi-Shaped Scene Text Detection Via Instance Segmentation

Arbitrary-Shaped Text Detection withAdaptive Text Region Representation

ADNet: Rethinking the Shrunk Polygon-Based Approach in Scene Text Detection

Text proposals with location-awareness-attention network for arbitrarily shaped scene text detection and recognition

What's Wrong with the Bottom-up Methods in Arbitrary-shape Scene Text Detection

An Accurate Threshold Insensitive Kernel Detector for Arbitrary Shaped Text.

Arbitrary-shaped scene text detection by predicting distance map

OPMP: An Omnidirectional Pyramid Mask Proposal Network for Arbitrary-Shape Scene Text Detection

CentripetalText: an Efficient Text Instance Representation for Scene Text Detection

A Robust Method: Arbitrary Shape Text Detection Combining Semantic and Position Information

Shape Robust Text Detection with Progressive Scale Expansion Network

Accurate Scene Text Detection Via Scale-Aware Data Augmentation and Shape Similarity Constraint