Abstract:Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at:

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Limitations of large - language models in the medical field**: Although existing large - language models (such as ChatGPT and GPT - 4) perform well on general tasks, they have limitations in medical applications, mainly because these models lack training specifically for medical data. This results in their poor performance in handling medical - related tasks, especially when precise and professional medical knowledge is required. 2. **Development of open - source large - language models for medicine**: Although some open - source large - language models (such as LLaMA) perform well in the general field, they also lack professional knowledge in the medical field. Therefore, developing open - source large - language models for medicine that can effectively utilize medical data has become an important research direction. 3. **Effects of continuous pre - training and instruction tuning**: The paper explores improving the performance of large - language models on medical tasks through continuous pre - training and instruction tuning. Specifically, the paper introduces the Me - LLaMA model, a large - language model dedicated to the medical field based on LLaMA2. It uses large - scale medical data for continuous pre - training and enhances its performance on specific tasks through instruction tuning. 4. **The problem of catastrophic forgetting**: During the training process, the model may forget the knowledge it has learned before, especially when new data is introduced. The paper studies how to mitigate this problem by optimizing the training method to ensure that the model retains old knowledge while learning new knowledge. 5. **Model performance evaluation**: The paper proposes a comprehensive set of data sets and evaluation benchmarks (MIBE) to evaluate the performance of large - language models in the medical field on multiple tasks, including zero - shot, few - shot, and supervised learning tasks. By comparing with existing models, the superior performance of the Me - LLaMA model is verified. In summary, the main objective of this paper is to solve the limitations of existing large - language models in medical applications by developing the Me - LLaMA model, improve their performance on medical tasks, and provide a comprehensive evaluation framework to verify the effectiveness of the model.

Me LLaMA: Foundation Large Language Models for Medical Applications

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

PMC-LLaMA: Towards Building Open-source Language Models for Medicine

PMC-LLaMA: toward building open-source language models for medicine

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

PMC-LLaMA: Further Finetuning LLaMA on Medical Papers

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Ophtha-LLaMA2: A Large Language Model for Ophthalmology

LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

Demystifying Large Language Models for Medicine: A Primer

LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Large language models encode clinical knowledge

Large language models in health care: Development, applications, and challenges

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge

LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them