Abstract:Abstract Objective To optimize the training strategy of large language models for medical applications, focusing on creating clinically relevant systems that efficiently integrate into healthcare settings, while ensuring high standards of accuracy and reliability. Materials and Methods We curated a comprehensive collection of high-quality, domain-specific data and used it to train several models, each with different subsets of this data. These models were rigorously evaluated against standard medical benchmarks, such as the USMLE, to measure their performance. Furthermore, for a thorough effectiveness assessment, they were compared with other state-of-the-art medical models of comparable size. Results The models trained with a mix of high-quality, domain-specific, and general data showed superior performance over those trained on larger, less clinically relevant datasets (P < .001). Our 7-billion-parameter model Med5 scores 60.5% on MedQA, outperforming the previous best of 49.3% from comparable models, and becomes the first of its size to achieve a passing score on the USMLE. Additionally, this model retained its proficiency in general domain tasks, comparable to state-of-the-art general domain models of similar size. Discussion Our findings underscore the importance of integrating high-quality, domain-specific data in training large language models for medical purposes. The balanced approach between specialized and general data significantly enhances the model’s clinical relevance and performance. Conclusion This study sets a new standard in medical language models, proving that a strategically trained, smaller model can outperform larger ones in clinical relevance and general proficiency, highlighting the importance of data quality and expert curation in generative artificial intelligence for healthcare applications.

Domain-Specific Pretraining of Language Models: A Comparative Study in the Medical Field

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

Impact of high-quality, mixed-domain data on the performance of medical language models

TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine

Domain-specific LLM Development and Evaluation – A Case-study for Prostate Cancer

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Developing Healthcare Language Model Embedding Spaces

On Domain-Specific Post-Training for Multimodal Large Language Models

The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Distilling Large Language Models for Matching Patients to Clinical Trials