Abstract:We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and related notes describing the fictional patient encounter in the domain of respiratory diseases. The tabular portion of the data is generated through a Bayesian network, where both the causal structure between the variables and the conditional probabilities are proposed by an expert based on domain knowledge. We then prompt a large language model (GPT-4o) to generate a clinical note related to this patient encounter, describing the patient symptoms and additional context. The SynSUM dataset is primarily designed to facilitate research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text - the symptoms, in the case of SynSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. The dataset can be downloaded from <a class="link-external link-https" href="https://github.com/prabaey/SynSUM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the accuracy and efficiency of Clinical Information Extraction (CIE), especially in the presence of structured and unstructured medical records. Specifically, the authors constructed a synthetic dataset, SynSUM, aiming to improve the ability to extract concepts (such as symptoms) from text by incorporating domain knowledge (such as Bayesian networks). ### Main problems: 1. **Complexity and Inconsistency**: Existing Electronic Health Records (EHRs) contain structured tabular data and unstructured free - text, and these data often present complexity and inconsistency in practical applications. For example, although the MIMIC - III and MIMIC - IV datasets contain mixed data, they are too complex to be used for preliminary research, and the encoded tabular features are mainly used for billing rather than for completeness and accuracy. 2. **Lack of Application of Domain Knowledge**: Existing CIE systems fail to fully utilize the available medical domain knowledge to fill in the background information required by automated systems when extracting concepts. 3. **Need for a Suitable Benchmark Dataset**: In order to study how to use structured and unstructured data for improved clinical information extraction, researchers need a suitable dataset, which should have the following characteristics: - Contain structured tabular data and unstructured text; - Be able to link tabular variables with concepts in the text through domain knowledge; - Have no temporal complexity; - The text contains some additional context information that helps to understand tabular variables. ### Solutions: The SynSUM dataset is designed to address the above challenges. It includes: - **10,000 synthetic patient records**, each consisting of structured tabular variables and free - text describing the patient's visit. - **Bayesian network modeling**: Use expert - defined Bayesian networks to generate structured tabular data, where the causal structure and conditional probabilities are set based on domain knowledge. - **Large - language - model - generated text**: Generate relevant clinical notes according to the tabular data, ensuring that the text content is consistent with the tabular data and contains necessary context information. In this way, the SynSUM dataset provides a valuable resource for studying how to combine structured and unstructured data for more effective clinical information extraction. In addition, it can also be applied to other research areas, such as automated clinical reasoning, causal - effect estimation, and multi - modal synthetic data generation. ### Formula Summary: 1. **Bayesian Network Joint Probability Distribution**: \[ P(\text{asthma}, \text{smoking}, \ldots, \text{antibio}, \# \text{days}) = P(\text{asthma})P(\text{smoking})P(\text{COPD}|\text{smoking}) \cdots \] where each conditional probability distribution is defined according to the specific situation. 2. **Noisy - OR Distribution**: \[ P(Y = 1|X_1, \ldots, X_k) = 1-(1 - p_0)(1 - p_1)^{x_1} \cdots (1 - p_k)^{x_k} \] For example, for the symptom dyspnea: \[ P(\text{dysp}|\text{asthma}, \text{smoking}, \text{COPD}, \text{hayf}, \text{pneu})=\text{Noisy - OR}(p_0 = 0.05, p_{\text{asthma}} = 0.9, \ldots) \] 3. **Antibiotic Prescription Probability**: \[ P(\text{antibio}=\text{yes}|\text{policy}=x_{\text{po}}, \text{dysp}=x_d, \text{cough}=x_c, \text{

SynSUM -- Synthetic Benchmark with Structured and Unstructured Medical Records

MedSyn: LLM-based Synthetic Medical Text Generation Framework

Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text

CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations

MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations

A text-to-tabular approach to generate synthetic patient data using LLMs

ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

SIMpat: A Synthetic Benchmark for Similarity Metrics on Patient Representations

uMedSum: A Unified Framework for Advancing Medical Abstractive Summarization

SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark

A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models

Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis

Synthetic Data in Healthcare

MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries

CUED at ProbSum 2023: Hierarchical Ensemble of Summarization Models

Attention-based Clinical Note Summarization

Synthetic Event Time Series Health Data Generation

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques

Query-Focused EHR Summarization to Aid Imaging Diagnosis