Evaluation of Risk of Bias in Neuroimaging-Based Artificial Intelligence Models for Psychiatric Diagnosis: A Systematic Review

Zhiyi Chen,Xuerong Liu,Qingwu Yang,Yan-Jiang Wang,Kuan Miao,Zheng Gong,Yang Yu,Artemiy Leonov,Chunlei Liu,Zhengzhi Feng,Hu Chuan-Peng
DOI: https://doi.org/10.1001/jamanetworkopen.2023.1671
2023-01-01
JAMA Network Open
Abstract:IMPORTANCE Neuroimaging-based artificial intelligence (AI) diagnostic models have proliferated in psychiatry. However, their clinical applicability and reporting quality (ie, feasibility) for clinical practice have not been systematically evaluated. OBJECTIVE To systematically assess the risk of bias (ROB) and reporting quality of neuroimaging-based AI models for psychiatric diagnosis. EVIDENCE REVIEW PubMed was searched for peer-reviewed, full-length articles published between January 1, 1990, and March 16, 2022. Studies aimed at developing or validating neuroimaging-based AI models for clinical diagnosis of psychiatric disorders were included. Reference lists were further searched for suitable original studies. Data extraction followed the CHARMS (Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies) and PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analyses) guidelines. A closed-loop cross-sequential design was used for quality control. The PROBAST (Prediction Model Risk of Bias Assessment Tool) and modified CLEAR (Checklist for Evaluation of Image-Based Artificial Intelligence Reports) benchmarks were used to systematically evaluate ROB and reporting quality. FINDINGS A total of 517 studies presenting 555 AI models were included and evaluated. Of these models, 461 (83.1%; 95% CI, 80.0%-86.2%) were rated as having a high overall ROB based on the PROBAST. The ROB was particular high in the analysis domain, including inadequate sample size (398 of 555 models [71.7%; 95% CI, 68.0%-75.6%]), poor model performance examination (with 100% of models lacking calibration examination), and lack of handling data complexity (550 of 555 models [99.1%; 95% CI, 98.3%-99.9%]). None of the AI models was perceived to be applicable to clinical practices. Overall reporting completeness (ie, number of reported items/number of total items) for the AI models was 61.2% (95% CI, 60.6%-61.8%), and the completeness was poorest for the technical assessment domain with 39.9% (95% CI, 38.8%-41.1%). CONCLUSIONS AND RELEVANCE This systematic review found that the clinical applicability and feasibility of neuroimaging-based AI models for psychiatric diagnosis were challenged by a high ROB and poor reporting quality. Particularly in the analysis domain, ROB in AI diagnostic models should be addressed before clinical application.
What problem does this paper attempt to address?