Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community. Datasets and models are available at <a class="link-external link-https" href="https://github.com/TsinghuaC3I/UltraMedical" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem this paper attempts to address is enhancing the performance of open-source large language models (LLMs) in the biomedical field, making them comparable to proprietary models, while also addressing the privacy and security challenges that proprietary models may face. Specifically, the paper improves the performance of models based on the Llama-3 series on various medical benchmarks by constructing high-quality manual and synthetic datasets (UltraMedical) and fine-tuning these models using these datasets. Additionally, the paper develops robust reward models to enhance online preference learning capabilities. ### Main Contributions: 1. **Construction of High-Quality Datasets**: The UltraMedical dataset, containing approximately 410,000 medical instructions, was constructed. These instructions combine manual and synthetic prompts, with around 100,000 instructions annotated with completion preferences from advanced medical and general models, used for fine-tuning, reward modeling, and preference learning. 2. **Model Fine-Tuning**: By fine-tuning the Llama-3 series models with a multi-step optimization strategy, competitive performance on open-source medical benchmarks was achieved, narrowing the gap between open-source and proprietary models. 3. **Reward Model Training**: Based on UltraMedical preference data, medical reward benchmarks were annotated, and for the first time, reward models were trained in the biomedical field, achieving advanced performance on annotated medical and general reward benchmarks. 4. **Public Resources**: The datasets and models were released on GitHub and Huggingface, aiming to foster collaboration and accelerate the development of generative AI in the biomedical field. ### Problems Addressed: - **Performance Gap**: By using high-quality datasets and advanced fine-tuning techniques, the performance gap between open-source and proprietary models on medical tasks is narrowed. - **Privacy and Security**: Provides an alternative to using proprietary models that may involve privacy and security risks. - **Preference Learning**: Explores and applies preference learning techniques in the biomedical field to improve the reasoning ability and practicality of the models. In summary, this paper aims to enhance the performance and practicality of open-source large language models in the biomedical field through the construction of high-quality datasets and advanced training methods, while also addressing privacy and security issues.

UltraMedical: Building Specialized Generalists in Biomedicine

Towards Evaluating and Building Versatile Large Language Models for Medicine

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

PMC-LLaMA: Towards Building Open-source Language Models for Medicine

PMC-LLaMA: toward building open-source language models for medicine

From Beginner to Expert: Modeling Medical Knowledge into General LLMs

Me LLaMA: Foundation Large Language Models for Medical Applications

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing

SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation

Large Language Model Benchmarks in Medical Tasks

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

CMB: A Comprehensive Medical Benchmark in Chinese

Towards Building Multilingual Language Model for Medicine

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

A Survey for Large Language Models in Biomedicine