Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents

Emanuela Boros,Maud Ehrmann
2024-09-26
Abstract:This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the impact of Optical Character Recognition (OCR) noise on Named Entity Recognition (NER) performance when dealing with historical documents. Specifically, the authors investigate whether there are neurons within the Transformer architecture that are sensitive to OCR noise and explore how these neurons affect the model's NER performance on historical documents. By analyzing the activation patterns of neurons under clean and noisy text inputs, the authors aim to identify and neutralize these OCR-sensitive neurons to improve the model's performance. ### Main Objectives of the Paper: 1. **Evaluate whether model components (particularly layers and neurons) are sensitive to OCR noise**: Identify components that consistently react to noisy text by measuring the activation differences of model components in response to clean and noisy inputs. 2. **Determine whether these components can be controlled to reduce their negative impact on NER performance**: Observe the effect on named entity detection performance in historical documents by attempting to neutralize OCR-sensitive neurons. ### Experimental Methods: - **Experimental Setup**: Using two pre-trained large language models (Llama2 and Mistral), a token dataset with varying levels of OCR noise was constructed. - **Detecting OCR Noise-Sensitive Layer Regions**: Compare the activation of each layer in response to correct and noisy tokens using the Centered Kernel Alignment (CKA) similarity index. - **Identifying OCR Noise-Sensitive Neurons**: Identify significantly "deviant" neurons by analyzing the activation differences of neurons in response to correct and noisy tokens. - **Neuron Ablation Experiments**: Observe the impact on NER performance by adjusting the activation values of specific neurons. ### Key Findings: - **Certain layers and neurons are highly sensitive to OCR noise**: Particularly in the middle layers (2-11 and 13-23), these layers exhibit significant activation differences in response to noisy inputs. - **Neutralizing OCR-sensitive neurons can improve NER performance**: By neutralizing a specific number of neurons, the F1 score can be significantly improved on certain datasets. ### Conclusion: The study provides a preliminary evaluation of OCR-sensitive layers and neurons in Transformer models, revealing their impact on model performance. Optimizing these sensitive layers, especially the final layers, can improve NER performance. However, the varying improvement effects across different datasets suggest the need for specific adaptation to different noise levels. Future work will focus on specific types of OCR errors and their distribution in training data, as well as extending to other models and datasets to verify the generality and performance improvement of sensitive neurons.