Abstract:Background: Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. Objective: This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. Methods: A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. Results: The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 ( P <.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory). Conclusions: The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.

Assessing Methods and Tools to Improve Reporting, Increase Transparency, and Reduce Failures in Machine Learning Applications in Health Care

Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness

A clinician's guide to understanding and critically appraising machine learning studies: a checklist for Ruling Out Bias Using Standard Tools in Machine Learning (ROBUST-ML)

Machine learning and AI research for Patient Benefit: 20 Critical Questions on Transparency, Replicability, Ethics and Effectiveness

Designing an ML Auditing Criteria Catalog as Starting Point for the Development of a Framework

Deep neural models for automated multi-task diagnostic scan management—quality enhancement, view classification and report generation

Key concepts, common pitfalls, and best practices in artificial intelligence and machine learning: focus on radiomics

Key Technology Considerations in Developing and Deploying Machine Learning Models in Clinical Radiology Practice

Machine Learning for Benchmarking Critical Care Outcomes

MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care

Proceedings From the 2022 ACR-RSNA Workshop on Safety, Effectiveness, Reliability, and Transparency in AI

Reproducible Reporting of the Collection and Evaluation of Annotations for Artificial Intelligence Models

Clinician checklist for assessing suitability of machine learning applications in healthcare

Strategies for Implementing Machine Learning Algorithms in the Clinical Practice of Radiology

Recommended Requirements and Essential Elements for Proper Reporting of the Use of Artificial Intelligence Machine Learning Tools in Biomedical Research and Scientific Publications

Assessing the Reporting Quality of Machine Learning Algorithms in Head and Neck Oncology

A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review

Requirements and reliability of AI in the medical context

Towards Quality Management of Machine Learning Systems for Medical Applications

Towards quality management of artificial intelligence systems for medical applications

Towards a safe and efficient clinical implementation of machine learning in radiation oncology by exploring model interpretability, explainability and data-model dependency