TELII: Temporal Event Level Inverted Indexing for Cohort Discovery on a Large Covid-19 EHR Dataset

Yan Huang
2024-10-23
Abstract:Cohort discovery is a crucial step in clinical research on Electronic Health Record (EHR) data. Temporal queries, which are common in cohort discovery, can be time-consuming and prone to errors when processed on large EHR datasets. In this work, we introduce TELII, a temporal event level inverted indexing method designed for cohort discovery on large EHR datasets. TELII is engineered to pre-compute and store the relations along with the time difference between events, thereby providing fast and accurate temporal query capabilities. We implemented TELII for the OPTUM de-identified COVID-19 EHR dataset, which contains data from 8.87 million patients. We demonstrate four common temporal query tasks and their implementation using TELII with a MongoDB backend. Our results show that the temporal query speed for TELII is up to 2000 times faster than that of existing non-temporal inverted indexes. TELII achieves millisecond-level response times, enabling users to quickly explore event relations and find preliminary evidence for their research questions. Not only is TELII practical and straightforward to implement, but it also offers easy adaptability to other EHR datasets. These advantages underscore TELII's potential to serve as the query engine for EHR-based applications, ensuring fast, accurate, and user-friendly query responses.
Databases,Information Retrieval
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of low efficiency and error - proneness in time - based queries when conducting cohort discovery in large - scale electronic health record (EHR) datasets. Specifically: 1. **Challenges of time - based queries**: In clinical research, time - based queries on EHR data are very common, but these queries can be very time - consuming and error - prone when dealing with large - scale datasets. Traditional query tools are usually unable to handle query tasks with time constraints efficiently. 2. **Deficiencies of existing methods**: Existing non - temporal inverted - index methods have poor performance in handling time - based queries. Especially when facing large - scale EHR datasets, a large amount of computing resources and time are required to process these queries. 3. **Proposal of TELII**: To address the above challenges, the authors propose a new method named **TELII (Temporal Event Level Inverted Indexing)**. TELII provides fast and accurate time - based query capabilities by pre - calculating and storing the temporal relationships and time differences between events. This enables users to complete complex event - relationship queries within milliseconds, thus accelerating the discovery of preliminary research evidence. ### Specific improvements of TELII - **Pre - calculate temporal relationships**: TELII pre - calculates and stores the temporal relationships between events (such as "before", "after", "occurring simultaneously") as well as time differences, avoiding real - time calculation of these relationships during queries. - **Optimize query speed**: By using MongoDB as the back - end database, TELII can significantly improve the speed of time - based queries. Experimental results show that the query speed of TELII is 2,000 times faster than that of existing non - temporal inverted - index methods, and the response time reaches the millisecond level. - **High adaptability**: TELII is not only applicable to the OPTUM® COVID - 19 EHR dataset, but can also be easily adapted to other EHR datasets, having broad application potential. ### Summary This paper solves the problem of low efficiency in time - based queries in large - scale EHR datasets by introducing TELII, providing a faster and more accurate query tool for clinical research. This method not only improves the query speed, but also simplifies the handling of complex temporal relationships, enabling researchers to explore the relationships between events more efficiently and find preliminary evidence for their research questions.