Abstract:This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.

What problem does this paper attempt to address?

The paper attempts to address the impact of Optical Character Recognition (OCR) noise on Named Entity Recognition (NER) performance when dealing with historical documents. Specifically, the authors investigate whether there are neurons within the Transformer architecture that are sensitive to OCR noise and explore how these neurons affect the model's NER performance on historical documents. By analyzing the activation patterns of neurons under clean and noisy text inputs, the authors aim to identify and neutralize these OCR-sensitive neurons to improve the model's performance. ### Main Objectives of the Paper: 1. **Evaluate whether model components (particularly layers and neurons) are sensitive to OCR noise**: Identify components that consistently react to noisy text by measuring the activation differences of model components in response to clean and noisy inputs. 2. **Determine whether these components can be controlled to reduce their negative impact on NER performance**: Observe the effect on named entity detection performance in historical documents by attempting to neutralize OCR-sensitive neurons. ### Experimental Methods: - **Experimental Setup**: Using two pre-trained large language models (Llama2 and Mistral), a token dataset with varying levels of OCR noise was constructed. - **Detecting OCR Noise-Sensitive Layer Regions**: Compare the activation of each layer in response to correct and noisy tokens using the Centered Kernel Alignment (CKA) similarity index. - **Identifying OCR Noise-Sensitive Neurons**: Identify significantly "deviant" neurons by analyzing the activation differences of neurons in response to correct and noisy tokens. - **Neuron Ablation Experiments**: Observe the impact on NER performance by adjusting the activation values of specific neurons. ### Key Findings: - **Certain layers and neurons are highly sensitive to OCR noise**: Particularly in the middle layers (2-11 and 13-23), these layers exhibit significant activation differences in response to noisy inputs. - **Neutralizing OCR-sensitive neurons can improve NER performance**: By neutralizing a specific number of neurons, the F1 score can be significantly improved on certain datasets. ### Conclusion: The study provides a preliminary evaluation of OCR-sensitive layers and neurons in Transformer models, revealing their impact on model performance. Optimizing these sensitive layers, especially the final layers, can improve NER performance. However, the varying improvement effects across different datasets suggest the need for specific adaptation to different noise levels. Future work will focus on specific types of OCR errors and their distribution in training data, as well as extending to other models and datasets to verify the generality and performance improvement of sensitive neurons.

Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents

Neural OCR Post-Hoc Correction of Historical Corpora

Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach

An Assessment of the Impact of OCR Noise on Language Models

Toward a Period-specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Named Entity Recognition and Classification on Historical Documents: A Survey

Towards Robust Named Entity Recognition for Historic German

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

Transformer-Based UNet with Multi-Headed Cross-Attention Skip Connections to Eliminate Artifacts in Scanned Documents

GRU-SCANET: Unleashing the Power of GRU-based Sinusoidal CApture Network for Precision-driven Named Entity Recognition

Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents

Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents

Neural Named Entity Recognition from Subword Units

MoGCN: Mixture of Gated Convolutional Neural Network for Named Entity Recognition of Chinese Historical Texts

Old Content and Modern Tools - Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Yes but.. Can ChatGPT Identify Entities in Historical Documents?

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

TENER: Adapting Transformer Encoder for Name Entity Recognition

An offline English optical character recognition and NER using LSTM and adaptive neuro-fuzzy inference system

NeuroPapyri: A Deep Attention Embedding Network for Handwritten Papyri Retrieval