Handwriting Recognition in Historical Documents with Multimodal LLM

Lucian Li
2024-10-31
Abstract:There is an immense quantity of historical and cultural documentation that exists only as handwritten manuscripts. At the same time, performing OCR across scripts and different handwriting styles has proven to be an enormously difficult problem relative to the process of digitizing print. While recent Transformer based models have achieved relatively strong performance, they rely heavily on manually transcribed training data and have difficulty generalizing across writers. Multimodal LLM, such as GPT-4v and Gemini, have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. In this paper, I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods. Keywords: Optical Character Recognition, Multimodal Language Models, Cultural Preservation, Mass digitization, Handwriting Recognitio
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of recognizing and transcribing handwritten historical documents. Specifically, there are a large number of documents in historical literature that exist only in manuscript form, which cannot be processed by existing computational text analysis methods, leading to low research efficiency. Traditional Optical Character Recognition (OCR) technology faces significant challenges in handling handwritten texts, especially in generalizing across different writing styles and languages. Therefore, the paper aims to evaluate the performance of multimodal large language models (such as Gemini) in the task of transcribing handwritten documents and compare them with the current state-of-the-art Transformer-based methods. ### Main Issues: 1. **Digitization of Handwritten Documents**: A large number of historical documents exist only in manuscript form, making effective computational text analysis impossible. 2. **Limitations of OCR Technology**: Existing OCR technology performs poorly in handling handwritten texts, especially in terms of generalization across different writing styles and languages. 3. **Need for Data Annotation**: Current deep learning models require a large amount of manually annotated data, which is a significant obstacle in practical applications. ### Solution: The paper explores the performance of multimodal large language models (such as Gemini) in the task of transcribing handwritten documents, examining their performance across different languages and writing styles, and comparing them with the current state-of-the-art Transformer-based methods. The specific methods include: - **Dataset**: Using multiple multilingual corpora as evaluation sets, including 17th-century Dutch documents, 17th-century German documents, etc. - **Model**: Re-implementing CNN-BiLSTM and TrOCR architectures, and conducting zero-shot and few-shot prompting experiments using the Gemini model. - **Evaluation**: Evaluating the performance of each model using metrics such as Character Error Rate (CER). ### Objectives: - Evaluate the performance of multimodal large language models in the task of transcribing handwritten documents. - Explore the generalization ability of multimodal large language models across different languages and writing styles. - Provide new technical means for the digitization and cultural preservation of historical documents.