Abstract:Background: Clinical Data Warehouses (CDW) reuse Electronic health records (EHR) to make their data retrievable for research purposes or patient recruitment for clinical trials. However, much information are hidden in unstructured data like discharge letters. They can be preprocessed and converted to structured data via information extraction (IE), which is unfortunately a laborious task and therefore usually not available for most of the text data in CDW. Objectives: The goal of our work is to provide an ad hoc IE service that allows users to query text data ad hoc in a manner similar to querying structured data in a CDW. While search engines just return text snippets, our systems also returns frequencies (e.g. how many patients exist with "heart failure" including textual synonyms or how many patients have an LVEF < 45) based on the content of discharge letters or textual reports for special investigations like heart echo. Three subtasks are addressed: (1) To recognize and to exclude negations and their scopes, (2) to extract concepts, i.e. Boolean values and (3) to extract numerical values. Methods: We implemented an extended version of the NegEx-algorithm for German texts that detects negations and determines their scope. Furthermore, our document oriented CDW PaDaWaN was extended with query functions, e.g. context sensitive queries and regex queries, and an extraction mode for computing the frequencies for Boolean and numerical values. Results: Evaluations in chest X-ray reports and in discharge letters showed high F1-scores for the three subtasks: Detection of negated concepts in chest X-ray reports with an F1-score of 0.99 and in discharge letters with 0.97; of Boolean values in chest X-ray reports about 0.99, and of numerical values in chest X-ray reports and discharge letters also around 0.99 with the exception of the concept age. Discussion: The advantages of an ad hoc IE over a standard IE are the low development effort (just entering the concept with its variants), the promptness of the results and the adaptability by the user to his or her particular question. Disadvantage are usually lower accuracy and confidence.This ad hoc information extraction approach is novel and exceeds existing systems: Roogle [1] extracts predefined concepts from texts at preprocessing and makes them retrievable at runtime. Dr. Warehouse [2] applies negation detection and indexes the produced subtexts which include affirmed findings. Our approach combines negation detection and the extraction of concepts. But the extraction does not take place during preprocessing, but at runtime. That provides an ad hoc, dynamic, interactive and adjustable information extraction of random concepts and even their values on the fly at runtime. Conclusions: We developed an ad hoc information extraction query feature for Boolean and numerical values within a CDW with high recall and precision based on a pipeline that detects and removes negations and their scope in clinical texts.

Dealing with Sparse Document and Topic Representations: Lab Report for CHiC 2012

What's in a ? Cross-Lingual Topic Detection & Information Retrieval in Archives Portal Europe

Ad Hoc Information Extraction for Clinical Data Warehouses

Semantic Publishing Challenge -- Assessing the Quality of Scientific Output

Exploratory Analysis of Highly Heterogeneous Document Collections

Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion

Extracting Event-Centric Document Collections from Large-Scale Web Archives

DWIE: An entity-centric dataset for multi-task document-level information extraction

Generating and Exploiting Semantically Enriched, Integrated, Linked and Open Museum Data

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles

Building Custom Term Suggestion Web Services with OAI-Harvested Open Data

Understanding Archives: Towards New Research Interfaces Relying on the Semantic Annotation of Documents

Semantic Publishing Challenge - Assessing the Quality of Scientific Output by Information Extraction and Interlinking

Generating Harder Cross-document Event Coreference Resolution Datasets using Metaphoric Paraphrasing

Fine-grained information extraction from German transthoracic echocardiography reports

Textual Analysis of ICALEPCS and IPAC Conference Proceedings: Revealing Research Trends, Topics, and Collaborations for Future Insights and Advanced Search

Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced Languages

Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment

Hacking History: Automatic Historical Event Extraction for Enriching Cultural Heritage Multimedia Collections

Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data