biorecap: an R package for summarizing bioRxiv preprints with a local LLM

Stephen D. Turner
2024-08-21
Abstract:The establishment of bioRxiv facilitated the rapid adoption of preprints in the life sciences, accelerating the dissemination of new research findings. However, the sheer volume of preprints published daily can be overwhelming, making it challenging for researchers to stay updated on the latest developments. Here, I introduce biorecap, an R package that retrieves and summarizes bioRxiv preprints using a large language model (LLM) running locally on nearly any commodity laptop. biorecap leverages the ollamar package to interface with the Ollama server and API endpoints, allowing users to prompt any local LLM available through Ollama. The package follows tidyverse conventions, enabling users to pipe the output of one function as input to another. Additionally, biorecap provides a single wrapper function that generates a timestamped CSV file and HTML report containing short summaries of recent preprints published in user-configurable subject areas. By combining the strengths of LLMs with the flexibility and security of local execution, biorecap represents an advancement in the tools available for managing the information overload in modern scientific research. The biorecap R package is available on GitHub at <a class="link-external link-https" href="https://github.com/stephenturner/biorecap" rel="external noopener nofollow">this https URL</a> under an open-source (MIT) license.
Other Quantitative Biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of information overload in modern life - science research, especially how to help researchers quickly and effectively obtain and understand the large number of bioRxiv preprint articles published daily. Specifically: 1. **The problem of information overload**: With the rise of preprint servers such as bioRxiv, the release speed of research papers in the life - science field has significantly increased. However, the large number of preprints published every day makes it difficult for researchers to keep up with the latest research progress, and it is very difficult to screen out the most relevant and important research. 2. **Improving the efficiency of information acquisition**: To meet this challenge, the author has developed an R package named `biorecap`, which aims to automatically retrieve and summarize preprint articles on bioRxiv by using a large - scale language model (LLM) running locally. This not only improves the efficiency of information acquisition but also ensures the security and privacy of data, because all processing is carried out locally without relying on cloud - based services. 3. **Flexibility and security**: `biorecap` combines the powerful text - processing capabilities of large - scale language models with the flexibility and security of local execution, enabling researchers to run these models on almost any ordinary laptop, thereby reducing costs and increasing data control. In summary, the main objective of this paper is to help researchers manage and understand a large number of preprint literatures more efficiently by developing the `biorecap` tool, so as to better keep up with the latest progress in the life - science field.