Block-level Text Spotting with LLMs

Ganesh Bannur,Bharadwaj Amrutur
2024-06-19
Abstract:Text spotting has seen tremendous progress in recent years yielding performant techniques which can extract text at the character, word or line level. However, extracting blocks of text from images (block-level text spotting) is relatively unexplored. Blocks contain more context than individual lines, words or characters and so block-level text spotting would enhance downstream applications, such as translation, which benefit from added context. We propose a novel method, BTS-LLM (Block-level Text Spotting with LLMs), to identify text at the block level. BTS-LLM has three parts: 1) detecting and recognizing text at the line level, 2) grouping lines into blocks and 3) finding the best order of lines within a block using a large language model (LLM). We aim to exploit the strong semantic knowledge in LLMs for accurate block-level text spotting. Consequently if the text spotted is semantically meaningful but has been corrupted during text recognition, the LLM is also able to rectify mistakes in the text and produce a reconstruction of it.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **block - level text spotting**. Specifically, existing text - location techniques mainly focus on the character, word or line level, while block - level text spotting (that is, recognizing and understanding multiple lines of text in an image as a whole) has been relatively less explored. Block - level text contains more context information than a single line, word or character, so it can enhance the effectiveness of downstream applications (such as translation, etc.). ### Main Problems and Challenges 1. **Complexity of Block - level Text Spotting**: - Block - level text spotting requires not only recognizing the text content but also understanding the arrangement order of the text, which involves complex visual and language - understanding tasks. - Due to the free structure of scene text (that is, there is no fixed pattern for the position and arrangement of text in an image), it is more difficult to determine the correct reading order. 2. **Combination of Semantic Understanding and Spatial Arrangement**: - Determining the correct order of each line of text within a block depends not only on their spatial position (that is, the arrangement of bounding boxes) but also on understanding the semantic meaning of the text. - For example, in some cases, even if the spatial arrangement is inconsistent, the semantically more meaningful order may be the correct one. 3. **Limitations of Existing Methods**: - Current line - level text - location methods cannot be directly applied to block - level text because they lack the ability to understand multiple lines of text as a whole. - Existing block - level text - detection methods usually only focus on the detection of bounding boxes, ignoring the actual content of the text and its semantics. ### Solutions To solve the above problems, the author proposes a new method - **BTS - LLM (Block - level Text Spotting with LLMs)**, which achieves block - level text spotting through the following steps: 1. **Line - level Text Detection and Recognition**: First, use existing high - performance models (such as Unified Detector) to perform line - level text detection and recognize the content of each line of text. 2. **Combining Lines into Blocks**: According to the positional relationship of lines, combine adjacent lines into blocks. 3. **Using Large Language Models (LLM) to Determine the Line Order within Blocks**: Utilize the powerful semantic - understanding ability of LLM to determine the correct order of each line of text within a block. If there are multiple possible orders, the LLM will select the most reasonable order according to semantics; if there is no obvious semantic order, it will be determined according to the spatial arrangement. ### Main Contributions - Propose a pipeline - based method for block - level text spotting. - For the first time, introduce large language models (LLM) into the text - location task, using their powerful semantic - understanding ability to improve the accuracy of text location. - Through experiments, it is proved that this method performs well in block - level text - location tasks, especially when dealing with high - density text. ### Summary The main goal of this paper is to fill the gap in the field of block - level text spotting. By combining the advantages of visual and language models, it provides a more accurate and comprehensive text - location method. This method not only improves the accuracy of text recognition but also enhances the understanding of text content, thus providing better support for subsequent applications.