Abstract:Introduction Large language models (LLMs) have gained popularity due to their natural language generation and interpretation capabilities. Integrating these models in medicine enables multiple tasks like summarizing medical histories, synthesizing literature, and suggesting diagnoses. Models like ChatGPT, GPT-4, and Med-PaLM2 (Singhal et al., 2023) have demonstrated their proficiency by achieving high scores in medical tests like the United States Medical Licensing Examination (USMLE) (Kung et al., 2023). However, LLMs may sometimes be inaccurate, providing unverified and erroneous information. In this study, we investigate the potential uses of LLMs in hematology, assessing their knowledge through hematology questions from the USMLE. Additionally, we propose augmenting LLMs with retrieval capabilities for medical guidelines in order to eliminate incorrect information. By extracting relevant information from specified medical documents, this approach holds the potential to streamline decision-making processes. Methods For comparative purposes, all experiments were conducted using both GPT 3.5-turbo and GPT-4 models. In a first step, we evaluated the general knowledge and performance of LLM in the field of hematology by testing it in a collected dataset of 127 question-answer pairs from the hematology section (covering various aspects of the field) of the USMLE. In a second step, we evaluated the proposed information retrieval framework using a set of 120 multiple-choice questions. These questions were specifically focused on the 4th revision of the World Health Organization classification of myeloid neoplasms and acute leukemia guidelines (subsequently called WHO 2017). By testing the framework on this domain-specific dataset, we aimed to assess its ability to extract specific clinical context and relevant information from complex clinical guidelines. Each question from the WHO 2017 guideline dataset was subjected to a comprehensive evaluation using two techniques. First, the questions were assessed using a zero-shot approach (the question together with the different options are directly posed to the model) to assess the LLM's capability to respond based on its own knowledge. Second, we employed our proposed retrieval information approach, enabling the system to conduct in-depth searches throughout the external documents (WHO 2017 guideline) to identify relevant (and similar) extracts about each question. Subsequently, the system provided answers based on the retrieved contexts from the document, thus facilitating more accurate and contextually informed responses. To achieve this, we created an embedding space containing the document's content and conducted a cosine-similarity search between a given question and all the content extracts from the document. The top three relevant extracts, based on similarity to the given question, were used as context for the LLM. Results In the evaluation of 127 hematology questions from the USMLE, GPT-3.5 in zero-shot mode achieved 63% accuracy, while GPT-4 demonstrated a higher accuracy rate of 82%. The evaluation of the WHO 2017 questions dataset revealed that the zero-shot approach achieved accuracy rates of 51% for GPT-3.5 and 71% for GPT-4. Incorporating information retrieval, retrieving the three most relevant extracts from the guidelines, substantially improved performance, with GPT-3.5 achieving 86% accuracy and GPT-4 demonstrating 97% accuracy. Conclusions LLMs have great potential, with current models showcasing substantial knowledge in hematology. However, ensuring their consistency and safety in responses is critical for their reliable application in medical settings (Thirunavukarasu et al., 2023). To address this, we demonstrated the benefits of information retrieval for question-answering in the field of hematology, significantly improving response reliability and accuracy by empowering LLMs to deliver more informed and contextually appropriate answers. The concept was effectively validated using the WHO 2017 guideline, and it can be effortlessly adapted to answer questions based on any set of hematology-related documents. Leveraging LLMs has the potential to significantly enhance the efficiency and effectiveness of clinical, educational, and research work in hematology.

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Optimizing large language models in digestive disease: strategies and challenges to improve clinical outcomes

oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard

Systematic review: The use of large language models as medical chatbots in digestive diseases

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Development of a Liver Disease-Specific Large Language Model Chat Interface using Retrieval Augmented Generation

Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

Assessment of Artificial Intelligence Language Models and Information Retrieval Strategies for QA in Hematology

Evaluating and Enhancing Large Language Models' Performance in Domain-Specific Medicine: Development and Usability Study With DocOA

Enhancing Large Language Models for Clinical Decision Support by Incorporating Clinical Practice Guidelines

Based on Medicine, The Now and Future of Large Language Models

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

GastroBot: a Chinese gastrointestinal disease chatbot based on the retrieval-augmented generation