Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model

Zhizheng Wang,Xiao Fan Liu,Zhanwei Du,Lin Wang,Ye Wu,Petter Holme,Michael Lachmann,Hongfei Lin,Zoie S.Y. Wong,Xiao-Ke Xu,Yuanyuan Sun
DOI: https://doi.org/10.1016/j.isci.2022.105079
IF: 5.8
2022-10-21
iScience
Abstract:Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation.
multidisciplinary sciences
What problem does this paper attempt to address?