Abstract:Background Recent enhancements in Large Language Models (LLMs) such as ChatGPT have exponentially increased user adoption. These models are accessible on mobile devices and support multimodal interactions, including conversations, code generation, and patient image uploads, broadening their utility in providing healthcare professionals with real-time support for clinical decision-making. Nevertheless, many authors have highlighted serious risks that may arise from the adoption of LLMs, principally related to safety and alignment with ethical guidelines. Objective To address these challenges, we introduce a novel methodological approach designed to assess the specific feasibility of adopting LLMs within a healthcare area, with a focus on clinical nursing, evaluating their performance and thereby directing their choice. Emphasizing LLMs' adherence to scientific advancements, this approach prioritizes safety and care personalization, according to the "Organization for Economic Co-operation and Development" frameworks for responsible AI. Moreover, its dynamic nature is designed to adapt to future evolutions of LLMs. Method Through integrating advanced multidisciplinary knowledge, including Nursing Informatics, and aided by a prospective literature review, seven key domains and specific evaluation items were identified as follows: 1. State of the Art Alignment & Safety. 2. Focus, Accuracy & Management of Prompt Ambiguity. 3. Data Integrity, Data Security, Ethics & Sustainability, in accordance with OECD Recommendations for Responsible AI. 4. Temporal Variability of Responses (Consistency) 5. Adaptation to specific standardized terminology and Classifications for healthcare professionals. 6. General Capabilities: Post User Feedback Self-Evolution Capability and Organization in Chapters. 7. Ability to Drive Evolution in Healthcare. A Peer Review by experts in Nursing and AI was performed, ensuring scientific rigor and breadth of insights for an essential, reproducible, and coherent methodological approach. By means of a 7-point Likert scale, thresholds are defined in order to classify LLMs as "unusable", "usable with high caution", and "recommended" categories. Nine state of the art LLMs were evaluated using this methodology in clinical oncology nursing decision-making, producing preliminary results. Gemini Advanced, Anthropic Claude 3 and ChatGPT 4 achieved the minimum score of the State of the Art Alignment & Safety domain for classification as "recommended", being also endorsed across all domains. LLAMA 3 70B and ChatGPT 3.5 were classified as "usable with high caution." Others were classified as unusable in this domain. Conclusion The identification of a recommended LLM for a specific healthcare area, combined with its critical, prudent, and integrative use, can support healthcare professionals in decision-making processes.

Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study

[Clinical application of large language models : Does ChatGPT replace medical report formulation? An experience report]

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

From RAGs to riches: Using large language models to write documents for clinical trials

[Large Language Models for Rapid Simplification of Quality Assurance Data Input: Field Trial with Real Data in the Context of Tumour Documentation in Urology]

A study of generative large language model for medical research and healthcare

From Text to Tables: A Local Privacy Preserving Large Language Model for Structured Information Retrieval from Medical Documents

The future landscape of large language models in medicine

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Using large language models for safety-related table summarization in clinical study reports

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Practical Applications of Large Language Models for Health Care Professionals and Scientists

A Survey of Clinicians’ Views of the Utility of Large Language Models

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Application of generative language models to orthopaedic practice

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making

Distilling large language models for matching patients to clinical trials