Abstract:As the urgency to address the climate crisis intensifies, the availability of accurate and comprehensive biodiversity data has become crucial for informing climate change studies, tracking key environmental indicators, and building global biodiversity monitoring platforms. The Biodiversity Heritage Library (BHL) plays a vital role in the core biodiversity infrastructure, housing over 60 million pages of digitized literature about life on Earth. Recognizing the value of over 500 years of data in BHL, a global network of BHL staff is working to establish a scalable data pipeline to provide actionable occurrence data from BHL’s vast and diverse collections. However, transforming textual content into FAIR (findable, accessible, interoperable, reusable) data poses challenges due to missing descriptive metadata and error-ridden unstructured outputs from commercial text engines. (Fig. 1) Despite the wealth of knowledge in BHL now available to global audiences, the underutilization of biodiversity and climate data contained in BHL's textual corpus hinders scientific research, hampers informed decision-making for conservation efforts, and limits our understanding of biodiversity patterns crucial for addressing the climate crisis. By leveraging recent advancements in text recognition engines, along with cutting-edge AI (Artificial Intelligence) models like OpenAI’s CLIP (Contrastive Language-Image Pre-Training) and nascent features in transcription platforms, BHL staff are beginning to process vast amounts of textual and image data and transform centuries worth of data from BHL collections into computationally usable formats. Recent technological breakthroughs now offer a transformative opportunity to empower the global biodiversity community with prescient insights from our shared past and facilitate the integration of historical knowledge into climate action initiatives. To bridge gaps in the historical record and unlock the potential of the Biodiversity Heritage Library (BHL), a multi-pronged effort utilizing innovative cross-disciplinary approaches is being piloted. These technical approaches were selected for their efficiency and ability to generate rapid results that could be applied across the diverse range of materials in BHL. (Fig. 2) Piloting a data pipeline that is scalable to 60 million pages requires considerable investigation, experimentation, and resources but will have an appreciable impact on global conservation efforts by informing and establishing historic baselines deeper into time. This presentation will focus on the identification, extraction, and transformation of OCR into structured data outputs in BHL. Approaches include: Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. The ongoing development of a scalable data pipeline of BHL’s relevant biodiversity and climate-related datasets requires sustained support and partnership with the biodiversity community. Initial results demonstrate that liberating data from archival and handwritten field notes is arduous but feasible. Extending these methodologies to the broader scientific literature presents new research opportunities. Extracting and normalizing data from unstructured textual sources can significantly advance biodiversity research and inform environmental policy. The Biodiversity Heritage Library staff are committed to building multiple scalable data pipelines with the ultimate goal of erecting a global biodiversity knowledge graph, rich in interconnected data and semantic meaning, enabling informed decisions for the preservation and sustainable management of Earth's biodiversity.

SQLite: A “Frictionless” Solution for Exchange of Biodiversity Data?

Mynodbcsv: lightweight zero-config database solution for handling very large CSV files

Avibase – a database system for managing and organizing taxonomic concepts

Specimods: A web-based tool for producing Genbank submission files for sequenced museum specimens

LabxDB: versatile databases for genomic sequencing and lab management

Specifying and Iterating over Virtual Datasets

How we developed a data exchange format: lessons learned from Camera Trap Data Package (Camtrap DP)

Enabling Published Taxonomic Data to be used to Address the Biodiversity Crisis: Biodiversity Literature Repository and TreatmentBank

Providing Authentic Long-term Archival Access to Complex Relational Data

Biodiversity data standards for the organization and dissemination of complex research projects and digital twins: a guide

Components of a Digital Specimen Architecture for Biological Collections

Data management in the modern structural biology and biomedical research environment

Unearthing the Past for a Sustainable Future: Extracting and transforming data in the Biodiversity Heritage Library for climate action

mzDB: a file format using multiple indexing strategies for the efficient analysis of large LC-MS/MS and SWATH-MS data sets

Technical Report: CSVM Ecosystem

BEXIS2: A FAIR-aligned data management system for biodiversity, ecology and environmental data

Repositories for Taxonomic Data: Where We Are and What is Missing

Getting the GIST: Testing an integrative data structure for linking taxonomy, biodiversity and conservation

Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data

Graphical User Interface for Biodiversity Digital Twins: Data Challenges

ExMove: An open‐source toolkit for processing and exploring animal‐tracking data in R