Abstract:There is an immense quantity of historical and cultural documentation that exists only as handwritten manuscripts. At the same time, performing OCR across scripts and different handwriting styles has proven to be an enormously difficult problem relative to the process of digitizing print. While recent Transformer based models have achieved relatively strong performance, they rely heavily on manually transcribed training data and have difficulty generalizing across writers. Multimodal LLM, such as GPT-4v and Gemini, have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. In this paper, I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods. Keywords: Optical Character Recognition, Multimodal Language Models, Cultural Preservation, Mass digitization, Handwriting Recognitio

What problem does this paper attempt to address?

The paper attempts to address the problem of recognizing and transcribing handwritten historical documents. Specifically, there are a large number of documents in historical literature that exist only in manuscript form, which cannot be processed by existing computational text analysis methods, leading to low research efficiency. Traditional Optical Character Recognition (OCR) technology faces significant challenges in handling handwritten texts, especially in generalizing across different writing styles and languages. Therefore, the paper aims to evaluate the performance of multimodal large language models (such as Gemini) in the task of transcribing handwritten documents and compare them with the current state-of-the-art Transformer-based methods. ### Main Issues: 1. **Digitization of Handwritten Documents**: A large number of historical documents exist only in manuscript form, making effective computational text analysis impossible. 2. **Limitations of OCR Technology**: Existing OCR technology performs poorly in handling handwritten texts, especially in terms of generalization across different writing styles and languages. 3. **Need for Data Annotation**: Current deep learning models require a large amount of manually annotated data, which is a significant obstacle in practical applications. ### Solution: The paper explores the performance of multimodal large language models (such as Gemini) in the task of transcribing handwritten documents, examining their performance across different languages and writing styles, and comparing them with the current state-of-the-art Transformer-based methods. The specific methods include: - **Dataset**: Using multiple multilingual corpora as evaluation sets, including 17th-century Dutch documents, 17th-century German documents, etc. - **Model**: Re-implementing CNN-BiLSTM and TrOCR architectures, and conducting zero-shot and few-shot prompting experiments using the Gemini model. - **Evaluation**: Evaluating the performance of each model using metrics such as Character Error Rate (CER). ### Objectives: - Evaluate the performance of multimodal large language models in the task of transcribing handwritten documents. - Explore the generalization ability of multimodal large language models across different languages and writing styles. - Provide new technical means for the digitization and cultural preservation of historical documents.

Handwriting Recognition in Historical Documents with Multimodal LLM

Unlocking the Archives: Using Large Language Models to Transcribe Handwritten Historical Documents

Handwritten Text Recognition for Documentary Medieval Manuscripts

Vision-Language Model Based Handwriting Verification

Motion-Based Handwriting Recognition

Representing Online Handwriting for Recognition in Large Vision-Language Models

Advancing Multilingual Handwritten Numeral Recognition With Attention-Driven Transfer Learning

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

Multi-script Handwritten Digit Recognition Using Multi-task Learning

Online Gesture Recognition using Transformer and Natural Language Processing

A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units

A tailored Handwritten-Text-Recognition System for Medieval Latin

HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

Improving Accuracy and Explainability of Online Handwriting Recognition

MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting Recognition

Handwritten text recognition and information extraction from ancient manuscripts using deep convolutional and recurrent neural network

Enhancement of handwritten text recognition using AI-based hybrid approach

Fully Convolutional Networks for Handwriting Recognition

Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents

Uncovering the Handwritten Text in the Margins: End-to-end Handwritten Text Detection and Recognition

Improving Handwritten Mathematical Expression Recognition via Integrating Convolutional Neural Network With Transformer and Diffusion-Based Data Augmentation