SynSUM -- Synthetic Benchmark with Structured and Unstructured Medical Records

Paloma Rabaey,Henri Arno,Stefan Heytens,Thomas Demeester
2024-09-13
Abstract:We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and related notes describing the fictional patient encounter in the domain of respiratory diseases. The tabular portion of the data is generated through a Bayesian network, where both the causal structure between the variables and the conditional probabilities are proposed by an expert based on domain knowledge. We then prompt a large language model (GPT-4o) to generate a clinical note related to this patient encounter, describing the patient symptoms and additional context. The SynSUM dataset is primarily designed to facilitate research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text - the symptoms, in the case of SynSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. The dataset can be downloaded from <a class="link-external link-https" href="https://github.com/prabaey/SynSUM" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the accuracy and efficiency of Clinical Information Extraction (CIE), especially in the presence of structured and unstructured medical records. Specifically, the authors constructed a synthetic dataset, SynSUM, aiming to improve the ability to extract concepts (such as symptoms) from text by incorporating domain knowledge (such as Bayesian networks). ### Main problems: 1. **Complexity and Inconsistency**: Existing Electronic Health Records (EHRs) contain structured tabular data and unstructured free - text, and these data often present complexity and inconsistency in practical applications. For example, although the MIMIC - III and MIMIC - IV datasets contain mixed data, they are too complex to be used for preliminary research, and the encoded tabular features are mainly used for billing rather than for completeness and accuracy. 2. **Lack of Application of Domain Knowledge**: Existing CIE systems fail to fully utilize the available medical domain knowledge to fill in the background information required by automated systems when extracting concepts. 3. **Need for a Suitable Benchmark Dataset**: In order to study how to use structured and unstructured data for improved clinical information extraction, researchers need a suitable dataset, which should have the following characteristics: - Contain structured tabular data and unstructured text; - Be able to link tabular variables with concepts in the text through domain knowledge; - Have no temporal complexity; - The text contains some additional context information that helps to understand tabular variables. ### Solutions: The SynSUM dataset is designed to address the above challenges. It includes: - **10,000 synthetic patient records**, each consisting of structured tabular variables and free - text describing the patient's visit. - **Bayesian network modeling**: Use expert - defined Bayesian networks to generate structured tabular data, where the causal structure and conditional probabilities are set based on domain knowledge. - **Large - language - model - generated text**: Generate relevant clinical notes according to the tabular data, ensuring that the text content is consistent with the tabular data and contains necessary context information. In this way, the SynSUM dataset provides a valuable resource for studying how to combine structured and unstructured data for more effective clinical information extraction. In addition, it can also be applied to other research areas, such as automated clinical reasoning, causal - effect estimation, and multi - modal synthetic data generation. ### Formula Summary: 1. **Bayesian Network Joint Probability Distribution**: \[ P(\text{asthma}, \text{smoking}, \ldots, \text{antibio}, \# \text{days}) = P(\text{asthma})P(\text{smoking})P(\text{COPD}|\text{smoking}) \cdots \] where each conditional probability distribution is defined according to the specific situation. 2. **Noisy - OR Distribution**: \[ P(Y = 1|X_1, \ldots, X_k) = 1-(1 - p_0)(1 - p_1)^{x_1} \cdots (1 - p_k)^{x_k} \] For example, for the symptom dyspnea: \[ P(\text{dysp}|\text{asthma}, \text{smoking}, \text{COPD}, \text{hayf}, \text{pneu})=\text{Noisy - OR}(p_0 = 0.05, p_{\text{asthma}} = 0.9, \ldots) \] 3. **Antibiotic Prescription Probability**: \[ P(\text{antibio}=\text{yes}|\text{policy}=x_{\text{po}}, \text{dysp}=x_d, \text{cough}=x_c, \text{