Abstract:Background: The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. Objective: This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. Methods: We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. Results: Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. Conclusions: Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.

Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other large language models in scholarly peer review

ChatGPT and large language models in academia: opportunities and challenges

Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

Human-in-the-Loop AI Reviewing: Feasibility, Opportunities, and Risks

The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review

A Critical Examination of the Ethics of AI-Mediated Peer Review

Can large language models replace humans in systematic reviews? Evaluating GPT‐4's efficacy in screening and extracting data from peer‐reviewed and grey literature in multiple languages

The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs)

AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews

Is ChatGPT a “Fire of Prometheus” for Non-Native English-Speaking Researchers in Academic Writing?

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review

ChatGPT and the Future of Journal Reviews: A Feasibility Study

Ten simple rules for using large language models in science, version 1.0

Perils and opportunities in using large language models in psychological research

Automatic Large Language Model Evaluation Via Peer Review

Delving into ChatGPT usage in academic writing through excess vocabulary

Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity