Language Model Powered Digital Biology with BRAD

Joshua Pickard,Ram Prakash,Marc Andrew Choi,Natalie Oliven,Cooper Stansbury,Jillian Cwycyshyn,Alex Gorodetsky,Alvaro Velasquez,Indika Rajapakse
2024-12-08
Abstract:Recent advancements in Large Language Models (LLMs) are transforming biology, computer science, engineering, and every day life. However, integrating the wide array of computational tools, databases, and scientific literature continues to pose a challenge to biological research. LLMs are well-suited for unstructured integration, efficient information retrieval, and automating standard workflows and actions from these diverse resources. To harness these capabilities in bioinformatics, we present a prototype Bioinformatics Retrieval Augmented Digital assistant (BRAD). BRAD is a chatbot and agentic system that integrates a variety of bioinformatics tools. The Python package implements an AI \texttt{Agent} that is powered by LLMs and connects to a local file system, online databases, and a user's software. The \texttt{Agent} is highly configurable, enabling tasks such as Retrieval-Augmented Generation, searches across bioinformatics databases, and the execution of software pipelines. BRAD's coordinated integration of bioinformatics tools delivers a context-aware and semi-autonomous system that extends beyond the capabilities of conventional LLM-based chatbots. A graphical user interface (GUI) provides an intuitive interface to the system.
Artificial Intelligence,Information Retrieval,Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the challenge of integrating large - language models (LLMs) with bioinformatics tools to improve the efficiency and accuracy of biological research. Specifically, the paper proposes a prototype system named BRAD (Bioinformatics Retrieval Augmented Digital assistant), which aims to solve these problems in the following ways: 1. **Integrating diverse computational tools, databases and scientific literature**: In current bioinformatics research, researchers need to handle a large number of tools, databases and literature, which poses challenges to the research. BRAD simplifies this process by integrating these resources and providing a unified platform. 2. **Automating standard workflows and operations**: BRAD utilizes the capabilities of LLMs to integrate unstructured data, perform efficient information retrieval, and automate standard workflows and operations. This not only improves work efficiency but also reduces the possibility of human error. 3. **Improving the accuracy and transparency of information**: By introducing Retrieval - Augmented Generation (RAG) technology, BRAD can obtain the latest and verifiable information from external data sources, thereby generating more accurate and reliable responses. This way reduces the "hallucination" phenomenon that may occur in LLM, that is, generating inaccurate or fictional information. 4. **Providing a user - friendly interface**: BRAD not only provides a command - line interface but also develops a graphical user interface (GUI), allowing users to interact with the system more intuitively, upload documents, perform searches, manage sessions, etc. ### Specific problem summary - **How to simplify the deployment of LLMs in bioinformatics?** - BRAD enables researchers to easily configure and expand the system through modular design and flexible tool interfaces to adapt to different research needs. - **How to ensure that the generated answers are accurate and verifiable?** - BRAD adopts RAG technology, combines external data sources and literature to ensure that the generated answers are based on reliable data and that users can trace the information sources. - **How to improve the usability and user experience of the system?** - BRAD provides multiple interaction methods, including chat interfaces and GUIs, making it more convenient for users to use the system, and supports multiple LLM models to meet the needs of different users. In conclusion, through the development of the BRAD system, this paper aims to solve the challenges in tool integration, information retrieval and automated operations in bioinformatics research, thereby improving research efficiency and accuracy.