Abstract:Background: Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. Objective: This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. Methods: A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. Results: The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 ( P <.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory). Conclusions: The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.

Data Checklist: On Unit-Testing Datasets with Usable Information

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Taxonomy-based CheckList for Large Language Model Evaluation

Learning Optimal Predictive Checklists

DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems

Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

Checklist and guidance on creating codelists for routinely collected health data research

Checkworthiness in Automatic Claim Detection Models: Definitions and Analysis of Datasets

Reproducibility in NLP: What Have We Learned from the Checklist?

Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL

SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists

Check-Eval: A Checklist-based Approach for Evaluating Text Quality

Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review

ACL Ready: RAG Based Assistant for the ACL Checklist

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Analyzing Dataset Annotation Quality Management in the Wild

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Learning predictive checklists from continuous medical data