Towards Unified Multi-granularity Text Detection with Interactive Attention

Xingyu Wan,Chengquan Zhang,Pengyuan Lyu,Sen Fan,Zihan Ni,Kun Yao,Errui Ding,Jingdong Wang

2024-05-30

Abstract:Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. This design enables DAT to efficiently manage text instances at different granularities, including *word*, *line*, *paragraph* and *page*. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances at varying granularities by correlating structural information across different text queries. As a result, it enables the model to achieve mutually beneficial detection performances across multiple text granularities. Additionally, a prompt-based segmentation module refines detection outcomes for texts of arbitrary curvature and complex layouts, thereby improving DAT's accuracy and expanding its real-world applicability. Experimental results demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks, including multi-oriented/arbitrarily-shaped scene text detection, document layout analysis and page detection tasks.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly addresses the problem of training multiple independent models for text detection in different scenes and granularities in existing OCR systems and document image analysis, which leads to significant computational complexity and resource requirements. To solve this problem, the paper proposes an advanced paradigm called "Detect Any Text" (DAT), which unifies scene text detection, layout analysis, and document page detection into an end-to-end model. DAT enhances the representation learning of different granularity text instances by using a cross-granularity interactive attention module to associate structural information between different text queries, thereby improving the detection performance of multiple text granularities. In addition, it also uses a prompt-based segmentation module to refine the text detection results of any shape and complex layout, enhance accuracy, and expand its application range in the real world. The key innovation of DAT lies in its cross-granularity interactive attention module, which effectively associates the structural information of text instances of different granularities and enhances the understanding of the representation of text instances from both bottom-up and top-down perspectives. The paper also designs a multi-granularity detection framework with a mixed-granularity training strategy to parallelly train datasets with incomplete granularity annotations. With these methods, DAT achieves state-of-the-art performance in various text-related benchmark tests, including scene text detection with multi-directions/any shapes, document layout analysis, and page detection tasks. In summary, the paper attempts to solve the problem of creating a single and efficient model that can handle text detection tasks of different granularities, reduce computational complexity, and improve the understanding of complex text and layouts.

Towards Unified Multi-granularity Text Detection with Interactive Attention

MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

High-speed Scene Text Detection with Attention and Multi-scale Label Generation

TextDCT: Arbitrary-Shaped Text Detection Via Discrete Cosine Transform Mask.

Text Position-Aware Pixel Aggregation Network with Adaptive Gaussian Threshold: Detecting Text in the Wild

Using of Attention for Scene Text Detection

Efficient Scene Text Detection with Textual Attention Tower

Multi-Orientation Scene Text Detection with Adaptive Clustering.

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

What's Wrong with the Bottom-up Methods in Arbitrary-shape Scene Text Detection

DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text Detection

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

Real-Time Scene Text Detection With Differentiable Binarization and Adaptive Scale Fusion

WordSup: Exploiting Word Annotations for Character based Text Detection

Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Attention-based Feature Decomposition-Reconstruction Network for Scene Text Detection

Multi-oriented Text Detection from Natural Scene Images Based on a CNN and Pruning Non-Adjacent Graph Edges

Aggregated Text Transformer for Scene Text Detection