Deep Learning with Weak Annotation from Diagnosis Reports for Detection of Multiple Head Disorders: a Prospective, Multicentre Study

Yuchen Guo,Yuwei He,Jinhao Lyu,Zhanping Zhou,Dong Yang,Liangdi Ma,Hao-tian Tan,Changjian Chen,Wei Zhang,Jianxing Hu,Dongshan Han,Guiguang Ding,Shixia Liu,Hui Qiao,Feng Xu,Xin Lou,Qionghai Dai
DOI: https://doi.org/10.1016/s2589-7500(22)00090-5
2022-01-01
Abstract:Background A large training dataset with high-quality annotations is necessary for building an accurate and generalisable deep learning system, which can be difficult and expensive to prepare in medical applications. We present a novel deep-learning-based system, requiring no annotator but weak annotation from a diagnosis report, for accurate and generalisable performance in detecting multiple head disorders from CT scans, including ischaemia, haemorrhage, tumours, and skull fractures. Methods Our system was developed on 104 597 head CT scans from the Chinese PLA General Hospital, with associated textual diagnosis reports. Without expert annotation, we used keyword matching on the reports to automatically generate disorder labels for each scan. The labels were inaccurate because of the unreliable annotator-free strategy and inexact because of scan-level annotation. We proposed RoLo, a novel weakly supervised learning algorithm, with a noise-tolerant mechanism and a multi-instance learning strategy to address these issues. RoLo was tested on retrospective (2357 scans from the Chinese PLA General Hospital), prospective (650 scans from the Chinese PLA General Hospital), cross-centre (1525 scans from the Brain Hospital of Hunan Province), cross-equipment (1484 scans from the Chinese PLA General Hospital), and cross-nation (CQ500 public dataset from India) test datasets. Four radiologists were tested on the prospective test dataset before and after viewing system recommendations to assess whether the system could improve diagnostic performance. Findings The area under the receiver operating characteristic curve for detecting the four disorder types was 0.976 (95% CI 0.976-0.976) for retrospective, 0.975 (0.974-0.976) for prospective, 0.965 (0.964-0.966) for cross-centre, and 0.971 (0.971-0.972) for cross-equipment test datasets, and 0.964 (0.964-0.966) for CQ500 (with only haemorrhage and fracture). The system achieved similar performance to four radiologists and helped to improve sensitivity and specificity by 0.109 (95% CI 0.086-0.131) and 0.022 (0.017-0.026), respectively. Interpretation Without expert annotated data, our system achieved accurate and generalisable performance for head disorder detection. The system improved the diagnostic performance of radiologists. Because of its accuracy and generalisability, our computer-aided diganostic system could be used in clinical practice to improve the accuracy and efficiency of radiologists in different hospitals. (C) 2022 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY-NC-ND 4.0 license.
What problem does this paper attempt to address?