Towards Unified Multi-granularity Text Detection with Interactive Attention

Xingyu Wan,Chengquan Zhang,Pengyuan Lyu,Sen Fan,Zihan Ni,Kun Yao,Errui Ding,Jingdong Wang
2024-05-30
Abstract:Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. This design enables DAT to efficiently manage text instances at different granularities, including *word*, *line*, *paragraph* and *page*. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances at varying granularities by correlating structural information across different text queries. As a result, it enables the model to achieve mutually beneficial detection performances across multiple text granularities. Additionally, a prompt-based segmentation module refines detection outcomes for texts of arbitrary curvature and complex layouts, thereby improving DAT's accuracy and expanding its real-world applicability. Experimental results demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks, including multi-oriented/arbitrarily-shaped scene text detection, document layout analysis and page detection tasks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper mainly addresses the problem of training multiple independent models for text detection in different scenes and granularities in existing OCR systems and document image analysis, which leads to significant computational complexity and resource requirements. To solve this problem, the paper proposes an advanced paradigm called "Detect Any Text" (DAT), which unifies scene text detection, layout analysis, and document page detection into an end-to-end model. DAT enhances the representation learning of different granularity text instances by using a cross-granularity interactive attention module to associate structural information between different text queries, thereby improving the detection performance of multiple text granularities. In addition, it also uses a prompt-based segmentation module to refine the text detection results of any shape and complex layout, enhance accuracy, and expand its application range in the real world. The key innovation of DAT lies in its cross-granularity interactive attention module, which effectively associates the structural information of text instances of different granularities and enhances the understanding of the representation of text instances from both bottom-up and top-down perspectives. The paper also designs a multi-granularity detection framework with a mixed-granularity training strategy to parallelly train datasets with incomplete granularity annotations. With these methods, DAT achieves state-of-the-art performance in various text-related benchmark tests, including scene text detection with multi-directions/any shapes, document layout analysis, and page detection tasks. In summary, the paper attempts to solve the problem of creating a single and efficient model that can handle text detection tasks of different granularities, reduce computational complexity, and improve the understanding of complex text and layouts.