UltraMedical: Building Specialized Generalists in Biomedicine

Kaiyan Zhang,Sihang Zeng,Ermo Hua,Ning Ding,Zhang-Ren Chen,Zhiyuan Ma,Haoxin Li,Ganqu Cui,Biqing Qi,Xuekai Zhu,Xingtai Lv,Hu Jinfang,Zhiyuan Liu,Bowen Zhou
2024-10-29
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community. Datasets and models are available at <a class="link-external link-https" href="https://github.com/TsinghuaC3I/UltraMedical" rel="external noopener nofollow">this https URL</a>
Computation and Language
What problem does this paper attempt to address?
The main problem this paper attempts to address is enhancing the performance of open-source large language models (LLMs) in the biomedical field, making them comparable to proprietary models, while also addressing the privacy and security challenges that proprietary models may face. Specifically, the paper improves the performance of models based on the Llama-3 series on various medical benchmarks by constructing high-quality manual and synthetic datasets (UltraMedical) and fine-tuning these models using these datasets. Additionally, the paper develops robust reward models to enhance online preference learning capabilities. ### Main Contributions: 1. **Construction of High-Quality Datasets**: The UltraMedical dataset, containing approximately 410,000 medical instructions, was constructed. These instructions combine manual and synthetic prompts, with around 100,000 instructions annotated with completion preferences from advanced medical and general models, used for fine-tuning, reward modeling, and preference learning. 2. **Model Fine-Tuning**: By fine-tuning the Llama-3 series models with a multi-step optimization strategy, competitive performance on open-source medical benchmarks was achieved, narrowing the gap between open-source and proprietary models. 3. **Reward Model Training**: Based on UltraMedical preference data, medical reward benchmarks were annotated, and for the first time, reward models were trained in the biomedical field, achieving advanced performance on annotated medical and general reward benchmarks. 4. **Public Resources**: The datasets and models were released on GitHub and Huggingface, aiming to foster collaboration and accelerate the development of generative AI in the biomedical field. ### Problems Addressed: - **Performance Gap**: By using high-quality datasets and advanced fine-tuning techniques, the performance gap between open-source and proprietary models on medical tasks is narrowed. - **Privacy and Security**: Provides an alternative to using proprietary models that may involve privacy and security risks. - **Preference Learning**: Explores and applies preference learning techniques in the biomedical field to improve the reasoning ability and practicality of the models. In summary, this paper aims to enhance the performance and practicality of open-source large language models in the biomedical field through the construction of high-quality datasets and advanced training methods, while also addressing privacy and security issues.