Hespi: A pipeline for automatically detecting information from hebarium specimen sheets

Robert Turnbull,Emily Fitzgerald,Karen Thompson,Joanne L. Birch
2024-10-11
Abstract:Specimen associated biodiversity data are sought after for biological, environmental, climate, and conservation sciences. A rate shift is required for the extraction of data from specimen images to eliminate the bottleneck that the reliance on human-mediated transcription of these data represents. We applied advanced computer vision techniques to develop the `Hespi' (HErbarium Specimen sheet PIpeline), which extracts a pre-catalogue subset of collection data on the institutional labels on herbarium specimens from their digital images. The pipeline integrates two object detection models; the first detects bounding boxes around text-based labels and the second detects bounding boxes around text-based data fields on the primary institutional label. The pipeline classifies text-based institutional labels as printed, typed, handwritten, or a combination and applies Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for data extraction. The recognized text is then corrected against authoritative databases of taxon names. The extracted text is also corrected with the aide of a multimodal Large Language Model (LLM). Hespi accurately detects and extracts text for test datasets including specimen sheet images from international herbaria. The components of the pipeline are modular and users can train their own models with their own data and use them in place of the models provided.
Computer Vision and Pattern Recognition,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the efficiency of extracting biodiversity data from herbarium specimen images, in order to eliminate the dependence on manual transcription of these data. Specifically, the paper introduces an automated pipeline named Hespi (HErbarium Specimen sheet PIpeline), aiming to automatically detect and extract text information on specimen labels through advanced computer vision techniques, thereby accelerating data acquisition required in fields such as biology, environment, climate, and conservation science. ### Specific Problem Description 1. **Data Extraction Bottleneck**: Currently, the digitization of a large amount of specimen data depends on manual transcription, which results in the data extraction speed being unable to meet the demand. The paper points out that there are approximately 3,500 active herbariums globally, containing about 398 million physical specimens. Although the number of high - resolution specimen images has increased significantly in recent years, the transcription speed of the related biodiversity data has not increased significantly. 2. **Requirement for Automated Data Extraction**: In order to eliminate this bottleneck, a method that can quickly and accurately extract text data from specimen images needs to be developed. The Hespi pipeline realizes the automatic detection and extraction of text information on specimen labels by combining deep learning, optical character recognition (OCR), handwritten text recognition (HTR), and multimodal large language models (LLM). ### Solution Overview - **Sheet - Component Model**: This model is used to detect various components in specimen images, such as institutional labels, tax numbers, annotation labels, etc., and output the corresponding bounding boxes. - **Label - Field Model**: This model further detects specific fields on institutional labels, such as taxonomic information, collector numbers, geographical locations, etc. - **Label Classifier**: The classifier is used to distinguish the text types (printed, typed, handwritten, or mixed) on the labels. - **Text Recognition**: Use OCR and HTR engines to recognize text and correct it through authoritative databases. - **Multimodal LLM Correction**: Use multimodal large language models to perform final correction on the extracted text to ensure accuracy. ### Main Contributions - **Improving Data Extraction Efficiency**: Through the automated process, the time and cost of manual transcription are greatly reduced. - **Modular Design**: Each component of the Hespi pipeline is modular, and users can replace or fine - tune the models as needed to adapt to different data distributions. - **Multilingual Support**: It can handle label data in different languages and formats, enhancing the universality and applicability of the system. In conclusion, the Hespi pipeline provides an efficient, accurate, and flexible solution to solve the bottleneck of specimen data extraction, which helps to accelerate the digitization process of biodiversity data.