Abstract:Purpose: Large language models (LLMs) are deep learning models designed to comprehend and generate meaningful responses, which have gained public attention in recent years. The purpose of this study is to evaluate and compare the performance of LLMs in answering questions regarding breast cancer in the Chinese context. Material and methods: ChatGPT, ERNIE Bot, and ChatGLM were chosen to answer 60 questions related to breast cancer posed by two oncologists. Responses were scored as comprehensive, correct but inadequate, mixed with correct and incorrect data, completely incorrect, or unanswered. The accuracy, length, and readability among answers from different models were evaluated using statistical software. Results: ChatGPT answered 60 questions, with 40 (66.7%) comprehensive answers and six (10.0%) correct but inadequate answers. ERNIE Bot answered 60 questions, with 34 (56.7%) comprehensive answers and seven (11.7%) correct but inadequate answers. ChatGLM generated 60 answers, with 35 (58.3%) comprehensive answers and six (10.0%) correct but inadequate answers. The differences for chosen accuracy metrics among the three LLMs did not reach statistical significance, but only ChatGPT demonstrated a sense of human compassion. The accuracy of the three models in answering questions regarding breast cancer treatment was the lowest, with an average of 44.4%. ERNIE Bot's responses were significantly shorter compared to ChatGPT and ChatGLM (p < .001 for both). The readability scores of the three models showed no statistical significance. Conclusions: In the Chinese context, the capabilities of ChatGPT, ERNIE Bot, and ChatGLM are similar in answering breast cancer-related questions at present. These three LLMs may serve as adjunct informational tools for breast cancer patients in the Chinese context, offering guidance for general inquiries. However, for highly specialized issues, particularly in the realm of breast cancer treatment, LLMs cannot deliver reliable performance. It is necessary to utilize them under the supervision of healthcare professionals.

DoctorGPT: A Large Language Model with Chinese Medical Question-Answering Capabilities

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

MedGo: A Chinese Medical Large Language Model

IvyGPT: InteractiVe Chinese pathwaY language model in medical domain

Continuous Training and Fine-tuning for Domain-Specific Language Models in Medical Question Answering

Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge

Integrating UMLS Knowledge into Large Language Models for Medical Question Answering

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge

Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering

ZhongJing: A Locally Deployed Large Language Model for Traditional Chinese Medicine and Corresponding Evaluation Methodology: A Large Language Model for data fine-tuning in the field of Traditional Chinese Medicine, and a new evaluation method called TCMEval are proposed

DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue

Development and evaluation of a large language model of ophthalmology in Chinese

MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine