biorecap: an R package for summarizing bioRxiv preprints with a local LLM

Stephen D. Turner

2024-08-21

Abstract:The establishment of bioRxiv facilitated the rapid adoption of preprints in the life sciences, accelerating the dissemination of new research findings. However, the sheer volume of preprints published daily can be overwhelming, making it challenging for researchers to stay updated on the latest developments. Here, I introduce biorecap, an R package that retrieves and summarizes bioRxiv preprints using a large language model (LLM) running locally on nearly any commodity laptop. biorecap leverages the ollamar package to interface with the Ollama server and API endpoints, allowing users to prompt any local LLM available through Ollama. The package follows tidyverse conventions, enabling users to pipe the output of one function as input to another. Additionally, biorecap provides a single wrapper function that generates a timestamped CSV file and HTML report containing short summaries of recent preprints published in user-configurable subject areas. By combining the strengths of LLMs with the flexibility and security of local execution, biorecap represents an advancement in the tools available for managing the information overload in modern scientific research. The biorecap R package is available on GitHub at <a class="link-external link-https" href="https://github.com/stephenturner/biorecap" rel="external noopener nofollow">this https URL</a> under an open-source (MIT) license.

Other Quantitative Biology

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of information overload in modern life - science research, especially how to help researchers quickly and effectively obtain and understand the large number of bioRxiv preprint articles published daily. Specifically: 1. **The problem of information overload**: With the rise of preprint servers such as bioRxiv, the release speed of research papers in the life - science field has significantly increased. However, the large number of preprints published every day makes it difficult for researchers to keep up with the latest research progress, and it is very difficult to screen out the most relevant and important research. 2. **Improving the efficiency of information acquisition**: To meet this challenge, the author has developed an R package named `biorecap`, which aims to automatically retrieve and summarize preprint articles on bioRxiv by using a large - scale language model (LLM) running locally. This not only improves the efficiency of information acquisition but also ensures the security and privacy of data, because all processing is carried out locally without relying on cloud - based services. 3. **Flexibility and security**: `biorecap` combines the powerful text - processing capabilities of large - scale language models with the flexibility and security of local execution, enabling researchers to run these models on almost any ordinary laptop, thereby reducing costs and increasing data control. In summary, the main objective of this paper is to help researchers manage and understand a large number of preprint literatures more efficiently by developing the `biorecap` tool, so as to better keep up with the latest progress in the life - science field.

biorecap: an R package for summarizing bioRxiv preprints with a local LLM

L-RAPiT: A Cloud-Based Computing Pipeline for the Analysis of Long-Read RNA Sequencing Data

Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R

Bioinformatics Copilot 1.0: A Large Language Model-powered Software for the Analysis of Transcriptomic Data

BioVisReport: A Markdown-based Lightweight Website Builder for Reproducible and Interactive Visualization of Results from Peer-Reviewed Publications

LitSumm: Large language models for literature summarisation of non-coding RNAs

Empowering biologists to decode omics data: the Genekitr R package and web server

AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines

RP-REP Ribosomal Profiling Reports: an open-source cloud-enabled framework for reproducible ribosomal profiling data processing, analysis, and result reporting

Reproducible manuscript preparation with RMarkdown application to JMSACL and other Elsevier Journals

Abstract 4956: A fast and efficient bioinformatics analysis workflow for processing reads from single-cell multiomics assays captured on a microwell-based platform

Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines

racoon_clip-a complete pipeline for single-nucleotide analyses of iCLIP and eCLIP data

RAG-RLRC-LaySum at BioLaySumm: Integrating Retrieval-Augmented Generation and Readability Control for Layman Summarization of Biomedical Texts

RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting

Creating reproducible dynamic reports using data from the Secure Anonymised Information Linkage (SAIL) Databank

Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications

tidysbml: R/Bioconductor package for SBML extraction into dataframes

isolateR: an R package for generating microbial libraries from Sanger sequencing data

RBioplot: an easy-to-use R pipeline for automated statistical analysis and data visualization in molecular biology and biochemistry

scRepertoire: An R-based toolkit for single-cell immune receptor analysis