Abstract:Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data and generate diverse insights. Biodiversity literature, with its broad range of topics, is no exception to this trend (Boyko et al. 2023, Castro et al. 2024). LLMs can help in information extraction and synthesis, text annotation and classification, and many other natural language processing tasks. We leverage LLMs to automate the information retrieval task from biodiversity publications, building upon data sourced from our previous work (Ahmed et al. 2024). In our previous work (Ahmed et al. 2023, Ahmed et al. 2024), we assessed the reproducibility of deep learning (DL) methods used in biodiversity research. We developed a manual pipeline to extract key information on DL pipelines—dataset, source code, open-source frameworks, model architecture, hyperparameters, software and hardware specs, randomness, averaging result and evaluation metrics from 61 publications (Ahmed et al. 2024). While this allowed analysis, it required extensive manual effort by domain experts, limiting scalability. To address this, we propose an automatic information extraction pipeline using LLMs with the Retrieval Augmented Generation (RAG) technique. RAG combines the retrieval of relevant documents with the generative capabilities of LLMs to enhance the quality and relevance of the extracted information. We employed an open-source LLM, Hugging Face implementation of Mixtral 8x7B (Jiang et al. 2024), a mixture of expert models in our pipeline (Fig. 1) and adapted the RAG pipeline from earlier work (Kommineni et al. 2024). The pipeline was run on a single NVIDIA A100 40GB graphics processing unit with 4-bit quantization. To evaluate our pipeline, we compared the expert-assisted manual approach with the LLM-assisted automatic approach. We measured their consistency using the inter-annotator agreement (IAA) and quantified it with the Cohen Kappa score (Pedregosa et al. 2011), where a higher score indicates more reliable and aligned outputs (1: maximum agreement, -1: no agreement). The Kappa score among human experts (annotators 1 and 2) was 0.54 (moderate agreement), while the scores comparing human experts with the LLM were 0.16 and 0.12 (slight agreement). The difference is partly due to human annotators having access to more information (including code, dataset, figures, tables and supplementary materials) than the LLM, which was restricted to the text itself. Given these restrictions, the results are promising but also show the potential to improve them by adding further modalities to the LLM inputs. Future work will involve several key improvements to our LLM-assisted information retrieval pipeline: Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Leveraging LLMs to automate information retrieval from biodiversity publications signifies a notable advancement in the scalable and efficient analysis of biodiversity literature. Initial results show promise, yet there is substantial potential for enhancement through the integration of multimodal data, optimized retrieval mechanisms, and comprehensive evaluation. By addressing these areas, we aim to improve the accuracy and utility of our pipeline, ultimately enabling broader and more in-depth analysis of biodiversity literature.

Automatic extraction of FAIR data from publications using LLM

Validation of the Scientific Literature via Chemputation Augmented by Large Language Models

An Autonomous Large Language Model Agent for Chemical Literature Data Mining

Automated Experimentation Powers Data Science in Chemistry.

Automated, LLM enabled extraction of synthesis details for reticular materials from scientific literature

Extracting accurate materials data from research papers with conversational language models and prompt engineering

Agent-based Learning of Materials Datasets from Scientific Literature

Human-artificial intelligence teaming for scientific information extraction from data-driven additive manufacturing research using large language models

Stoma Care Problems after Stoma Surgery in Northern Finland

Automated Chemical Reaction Extraction from Scientific Literature

Automating Research Synthesis with Domain-Specific Large Language Model Fine-Tuning

Autonomous LLM-driven research from data to human-verifiable research papers

Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study

Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

Replicating a High-Impact Scientific Publication Using Systems of Large Language Models

LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis

FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training?

From Text to Insight: Large Language Models for Materials Science Data Extraction

Investigating Deep-Learning NLP for Automating the Extraction of Oncology Efficacy Endpoints from Scientific Literature