Large Language Models for Simultaneous Named Entity Extraction and Spelling Correction

Edward Whittaker,Ikuo Kitagishi
2024-03-01
Abstract:Language Models (LMs) such as BERT, have been shown to perform well on the task of identifying Named Entities (NE) in text. A BERT LM is typically used as a classifier to classify individual tokens in the input text, or to classify spans of tokens, as belonging to one of a set of possible NE categories.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address the problem of simultaneously extracting named entities (NE) from text and correcting spelling errors. Specifically, the authors focus on extracting named entities from images of Japanese store receipts processed through Optical Character Recognition (OCR) and automatically correcting potential spelling errors during the extraction process. ### Background and Motivation 1. **Business Application Needs**: Many business applications require the extraction of named entities from text data, such as names, addresses, and dates. One such application is the automatic extraction of information from printed store receipts, which often requires an additional OCR preprocessing step. 2. **Errors Introduced by OCR**: The OCR step converts digital images of paper receipts into machine-readable text, but this process can introduce OCR errors, especially when visually similar characters (such as O and 0, or Y and ¥) are easily confused. 3. **Limitations of Existing Methods**: Existing named entity extraction pipelines, while capable of handling OCR errors, cannot recover the correct surface forms of entities. ### Solution 1. **Hypothesis**: The authors hypothesize that large language models (LLMs) with only decoders can extract named entities in a generative manner and automatically correct spelling errors in the input text. 2. **Experimental Design**: - **Baseline Models**: The authors use two BERT models as baselines. - **Experimental Models**: The authors also experiment with eight open-source LLMs. - **Dataset**: The experimental dataset includes 968 images of store receipts, each manually annotated with up to 344 different named entity categories. - **Training Data Variants**: - “truth”: The original training data, containing manually annotated characters of the text in each receipt image. - “ocr1”: The “truth” dataset with synthetic OCR errors randomly introduced based on a confusion matrix. - “ocr10”: Similar to “ocr1” but repeated 10 times, each time using a different random seed to generate 10 different training datasets. 3. **Evaluation Metrics**: The authors use Precision, Recall, and weighted F-measure to evaluate the performance of the models. ### Main Findings 1. **Best Model**: The best fine-tuned LLM (rinna/youri-7b) achieved a final performance of \( F_{\text{final}}^{(\text{test})} = 85.6 \) on the test set, slightly outperforming the best BERT model (bert-base-multilingual-cased, \( F_{\text{final}}^{(\text{test})} = 84.6 \)). 2. **OCR Error Distribution**: The system's OCR character error rate on the test set is about 1%, but the error distribution is uneven. In particular, the store name category often uses hard-to-recognize fonts, leading to a higher error rate. 3. **Classifier vs. Generator**: The advantage of the BERT model as a classifier is that it implicitly determines the positions of the extracted named entities, which is very useful for downstream processing. However, LLMs attempt to correct the surface forms of named entities during the generation process, a capability that existing methods lack. ### Conclusion The authors demonstrate that fine-tuning LLMs can effectively extract named entities and correct spelling errors from text with OCR errors, although performance varies across categories (e.g., addresses). This approach provides new insights into handling OCR errors in practical applications.