Abstract:Text spotting has seen tremendous progress in recent years yielding performant techniques which can extract text at the character, word or line level. However, extracting blocks of text from images (block-level text spotting) is relatively unexplored. Blocks contain more context than individual lines, words or characters and so block-level text spotting would enhance downstream applications, such as translation, which benefit from added context. We propose a novel method, BTS-LLM (Block-level Text Spotting with LLMs), to identify text at the block level. BTS-LLM has three parts: 1) detecting and recognizing text at the line level, 2) grouping lines into blocks and 3) finding the best order of lines within a block using a large language model (LLM). We aim to exploit the strong semantic knowledge in LLMs for accurate block-level text spotting. Consequently if the text spotted is semantically meaningful but has been corrupted during text recognition, the LLM is also able to rectify mistakes in the text and produce a reconstruction of it.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **block - level text spotting**. Specifically, existing text - location techniques mainly focus on the character, word or line level, while block - level text spotting (that is, recognizing and understanding multiple lines of text in an image as a whole) has been relatively less explored. Block - level text contains more context information than a single line, word or character, so it can enhance the effectiveness of downstream applications (such as translation, etc.). ### Main Problems and Challenges 1. **Complexity of Block - level Text Spotting**: - Block - level text spotting requires not only recognizing the text content but also understanding the arrangement order of the text, which involves complex visual and language - understanding tasks. - Due to the free structure of scene text (that is, there is no fixed pattern for the position and arrangement of text in an image), it is more difficult to determine the correct reading order. 2. **Combination of Semantic Understanding and Spatial Arrangement**: - Determining the correct order of each line of text within a block depends not only on their spatial position (that is, the arrangement of bounding boxes) but also on understanding the semantic meaning of the text. - For example, in some cases, even if the spatial arrangement is inconsistent, the semantically more meaningful order may be the correct one. 3. **Limitations of Existing Methods**: - Current line - level text - location methods cannot be directly applied to block - level text because they lack the ability to understand multiple lines of text as a whole. - Existing block - level text - detection methods usually only focus on the detection of bounding boxes, ignoring the actual content of the text and its semantics. ### Solutions To solve the above problems, the author proposes a new method - **BTS - LLM (Block - level Text Spotting with LLMs)**, which achieves block - level text spotting through the following steps: 1. **Line - level Text Detection and Recognition**: First, use existing high - performance models (such as Unified Detector) to perform line - level text detection and recognize the content of each line of text. 2. **Combining Lines into Blocks**: According to the positional relationship of lines, combine adjacent lines into blocks. 3. **Using Large Language Models (LLM) to Determine the Line Order within Blocks**: Utilize the powerful semantic - understanding ability of LLM to determine the correct order of each line of text within a block. If there are multiple possible orders, the LLM will select the most reasonable order according to semantics; if there is no obvious semantic order, it will be determined according to the spatial arrangement. ### Main Contributions - Propose a pipeline - based method for block - level text spotting. - For the first time, introduce large language models (LLM) into the text - location task, using their powerful semantic - understanding ability to improve the accuracy of text location. - Through experiments, it is proved that this method performs well in block - level text - location tasks, especially when dealing with high - density text. ### Summary The main goal of this paper is to fill the gap in the field of block - level text spotting. By combining the advantages of visual and language models, it provides a more accurate and comprehensive text - location method. This method not only improves the accuracy of text recognition but also enhances the understanding of text content, thus providing better support for subsequent applications.

Block-level Text Spotting with LLMs

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

A method for text line detection in natural images

Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text

Towards End-to-End Text Spotting in Natural Scenes

LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model

Text Line Segmentation from Struck-out Handwritten Document Images

ASTS: A Unified Framework for Arbitrary Shape Text Spotting.

TnT-LLM: Text Mining at Scale with Large Language Models

Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

Character Spotting Using Machine Learning Techniques

Mlts: A Multi-Language Scene Text Spotter

Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection Via Querying ChatGPT.

Automatic Text Location in Natural Scene Images

Text Fluoroscopy: Detecting LLM-Generated Text through Intrinsic Features

Automatic Mapping of Anatomical Landmarks from Free-Text Using Large Language Models: Insights from Llama-2

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Deciphering Textual Authenticity: A Generalized Strategy through the Lens of Large Language Semantics for Detecting Human vs. Machine-Generated Text

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting

Skeleton Matching based approach for Text Localization in Scene Images