Abstract:Background Recent enhancements in Large Language Models (LLMs) such as ChatGPT have exponentially increased user adoption. These models are accessible on mobile devices and support multimodal interactions, including conversations, code generation, and patient image uploads, broadening their utility in providing healthcare professionals with real-time support for clinical decision-making. Nevertheless, many authors have highlighted serious risks that may arise from the adoption of LLMs, principally related to safety and alignment with ethical guidelines. Objective To address these challenges, we introduce a novel methodological approach designed to assess the specific feasibility of adopting LLMs within a healthcare area, with a focus on clinical nursing, evaluating their performance and thereby directing their choice. Emphasizing LLMs' adherence to scientific advancements, this approach prioritizes safety and care personalization, according to the "Organization for Economic Co-operation and Development" frameworks for responsible AI. Moreover, its dynamic nature is designed to adapt to future evolutions of LLMs. Method Through integrating advanced multidisciplinary knowledge, including Nursing Informatics, and aided by a prospective literature review, seven key domains and specific evaluation items were identified as follows: 1. State of the Art Alignment & Safety. 2. Focus, Accuracy & Management of Prompt Ambiguity. 3. Data Integrity, Data Security, Ethics & Sustainability, in accordance with OECD Recommendations for Responsible AI. 4. Temporal Variability of Responses (Consistency) 5. Adaptation to specific standardized terminology and Classifications for healthcare professionals. 6. General Capabilities: Post User Feedback Self-Evolution Capability and Organization in Chapters. 7. Ability to Drive Evolution in Healthcare. A Peer Review by experts in Nursing and AI was performed, ensuring scientific rigor and breadth of insights for an essential, reproducible, and coherent methodological approach. By means of a 7-point Likert scale, thresholds are defined in order to classify LLMs as "unusable", "usable with high caution", and "recommended" categories. Nine state of the art LLMs were evaluated using this methodology in clinical oncology nursing decision-making, producing preliminary results. Gemini Advanced, Anthropic Claude 3 and ChatGPT 4 achieved the minimum score of the State of the Art Alignment & Safety domain for classification as "recommended", being also endorsed across all domains. LLAMA 3 70B and ChatGPT 3.5 were classified as "usable with high caution." Others were classified as unusable in this domain. Conclusion The identification of a recommended LLM for a specific healthcare area, combined with its critical, prudent, and integrative use, can support healthcare professionals in decision-making processes.

An evaluation of the capabilities of language models and nurses in providing neonatal clinical decision support

A comparative vignette study: Evaluating the potential role of a generative AI model in enhancing clinical decision‐making in nursing

Augmenting intensive care unit nursing practice with generative AI: A formative study of diagnostic synergies using simulation-based clinical cases

Clinical Reasoning of a Generative AI Model Compared With Physicians

The Case Records of ChatGPT: Language Models and Complex Clinical Questions

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

[AI-supported decision-making in obstetrics - a feasibility study on the medical accuracy and reliability of ChatGPT]

Assessing ChatGPT's capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis

Clinical decision making by ChatGPT vs medical oncologists: A retrospective concordance study.

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making

[ChatGPT for use in technology-enhanced learning in anesthesiology and emergency medicine and potential clinical application of AI language models : Between hype and reality around artificial intelligence in medical use]

Measured Performance and Healthcare Professional Perception of Large Language Models Used as Clinical Decision Support Systems: A Scoping Review

The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments

Evaluation of large language models in breast cancer clinical scenarios: A comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2

A - 133 The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

Clinical Knowledge and Reasoning Abilities of AI Large Language Models in Anesthesiology: A Comparative Study on the ABA Exam

Large language models in solving clinical dilemmas - advantages and drawbacks

Embracing the future—is artificial intelligence already better? A comparative study of artificial intelligence performance in diagnostic accuracy and decision‐making