Abstract:Handwriting Verification is a critical in document forensics. Deep learning based approaches often face skepticism from forensic document examiners due to their lack of explainability and reliance on extensive training data and handcrafted features. This paper explores using Vision Language Models (VLMs), such as OpenAI's GPT-4o and Google's PaliGemma, to address these challenges. By leveraging their Visual Question Answering capabilities and 0-shot Chain-of-Thought (CoT) reasoning, our goal is to provide clear, human-understandable explanations for model decisions. Our experiments on the CEDAR handwriting dataset demonstrate that VLMs offer enhanced interpretability, reduce the need for large training datasets, and adapt better to diverse handwriting styles. However, results show that the CNN-based ResNet-18 architecture outperforms the 0-shot CoT prompt engineering approach with GPT-4o (Accuracy: 70%) and supervised fine-tuned PaliGemma (Accuracy: 71%), achieving an accuracy of 84% on the CEDAR AND dataset. These findings highlight the potential of VLMs in generating human-interpretable decisions while underscoring the need for further advancements to match the performance of specialized deep learning models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the interpretability and data - dependence issues of **Handwriting Verification in document forensics**. Specifically, although the existing deep - learning - based handwriting verification methods have good performance, they face the following challenges: 1. **Lack of interpretability**: Deep - learning models (such as Convolutional Neural Networks, CNN) are usually regarded as black - box models and are difficult to provide clear, human - understandable decision - making explanations, which makes Forensic Document Examiners (FDE) skeptical of their results. 2. **Dependence on a large amount of labeled data**: These methods require a large number of labeled handwriting samples for training, and collecting and labeling these data are both expensive and time - consuming. To solve these problems, this paper proposes to use **Vision - Language Models (VLMs)**, especially OpenAI's GPT - 4o and Google's PaliGemma. VLMs can generate clear, human - understandable explanations through their **Visual Question Answering (VQA) capabilities and 0 - shot Chain - of - Thought (CoT) reasoning**, thereby increasing the transparency and credibility of model decisions. In addition, VLMs can adapt to different handwriting styles and perform well in zero - shot or few - shot situations. ### Main objectives 1. **Improve the interpretability of the model**: Generate natural - language explanations through VLMs to enable FDE to better understand and trust the model's decisions. 2. **Reduce the dependence on large - scale training data**: Utilize the transfer - learning capabilities of VLMs to perform handwriting verification without a large amount of labeled data. 3. **Adapt to diverse handwriting styles**: VLMs can better handle handwriting samples of different writing styles and improve the generalization ability of the model. ### Experimental results Although VLMs perform well in terms of interpretability and adaptability, they still lag behind CNN models (such as ResNet - 18) specifically fine - tuned for the handwriting verification task in terms of performance. Specifically: - The accuracy rate of **ResNet - 18** on the CEDAR AND dataset is 84%. - The accuracy rate of the zero - shot CoT prompt engineering method of **GPT - 4o** is 70%. - The accuracy rate of the supervised fine - tuning method of **PaliGemma** is 71%. These results indicate that although VLMs have great potential in generating human - interpretable decisions, their fine - tuning strategies still need to be further improved to enhance their performance and reliability in specific tasks. ### Conclusion This research shows the application potential of VLMs in handwriting verification, especially in improving the interpretability and adaptability of the model. However, in order to make them more competitive in actual forensic forensics, future research needs to further optimize the fine - tuning methods of VLMs to narrow the performance gap between them and specially designed deep - learning models.

Vision-Language Model Based Handwriting Verification

Representing Online Handwriting for Recognition in Large Vision-Language Models

Self-Supervised Learning Based Handwriting Verification

Multi-dilation Convolutional Neural Network for Automatic Handwritten Signature Verification

GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM

Explainable offline automatic signature verifier to support forensic handwriting examiners

Are VLMs Really Blind

Handwriting Recognition in Historical Documents with Multimodal LLM

Improving Accuracy and Explainability of Online Handwriting Recognition

AVN: an Adversarial Variation Network Model for Handwritten Signature Verification.

Vision language models are blind

Vision Graph Convolutional Network for Writer-Independent Offline Signature Verification

A Conformable Moments-Based Deep Learning System for Forged Handwriting Detection

Fully Convolutional Networks for Handwriting Recognition

Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

Robust Handwriting Recognition with Limited and Noisy Data

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

Handwritten Signature Verification using Deep Learning

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Handwritten English word recognition using a deep learning based object detection architecture

Leveraging Expert Models for Training Deep Neural Networks in Scarce Data Domains: Application to Offline Handwritten Signature Verification