A scalable approach for critical care data extraction and analysis in an academic medical center

Sebastian Daniel Boie,Falk Meyer-Eschenbach,Fabian Schreiber,Niklas Giesa,Jon Barrenetxea,Camille Guinemer,Stefan Haufe,Michael Krämer,Peter Brunecker,Fabian Prasser,Felix Balzer
DOI: https://doi.org/10.1016/j.ijmedinf.2024.105611
Abstract:Background: Electronic health records are a valuable asset for research, but their use is challenging due to inconsistencies of records, heterogeneous formats and the distribution over multiple, non-integrated information systems. Hence, specialized health data engineering and data science expertise are required to enable research. To facilitate secondary use of clinical routine data collected in our intensive care wards, we developed a scalable approach, consisting of cohort generation, variable filtering and data extraction steps. Objective: With this report we share our workflow of data request, cohort identification and data extraction. We present an algorithm for automatic data extraction from our critical care information system (CCIS) that can be adapted to other object-oriented data bases. Methods: We introduced a data request process with functionalities for automated identification of patient cohorts and a specialized hierarchical data structure that supports filtering relevant variables from the CCIS and further systems for the specified cohorts. The data extraction algorithm takes patient pseudonyms and variable lists as inputs. Algorithms are implemented in Python, leveraging the PySpark framework running on our data lake infrastructure. Results: Our data request process is in operational use since June 2022. Since then we have served 121 projects with 148 service requests in total. We discuss the hierarchical structure and the frequently used data items of our CCIS in detail and present an application example, including cohort selection, data extraction and data transformation into an analyses-ready format. Conclusions: Using clinical routine data for secondary research is challenging and requires an interdisciplinary team. We developed a scalable approach that automates steps for cohort identification, data extraction and common data pre-processing steps. Additionally, we facilitate data harmonization, integration and consult on typical data analysis scenarios, machine learning algorithms and visualizations in dashboards.
What problem does this paper attempt to address?