Abstract:Background: Clinical Data Warehouses (CDW) reuse Electronic health records (EHR) to make their data retrievable for research purposes or patient recruitment for clinical trials. However, much information are hidden in unstructured data like discharge letters. They can be preprocessed and converted to structured data via information extraction (IE), which is unfortunately a laborious task and therefore usually not available for most of the text data in CDW. Objectives: The goal of our work is to provide an ad hoc IE service that allows users to query text data ad hoc in a manner similar to querying structured data in a CDW. While search engines just return text snippets, our systems also returns frequencies (e.g. how many patients exist with "heart failure" including textual synonyms or how many patients have an LVEF < 45) based on the content of discharge letters or textual reports for special investigations like heart echo. Three subtasks are addressed: (1) To recognize and to exclude negations and their scopes, (2) to extract concepts, i.e. Boolean values and (3) to extract numerical values. Methods: We implemented an extended version of the NegEx-algorithm for German texts that detects negations and determines their scope. Furthermore, our document oriented CDW PaDaWaN was extended with query functions, e.g. context sensitive queries and regex queries, and an extraction mode for computing the frequencies for Boolean and numerical values. Results: Evaluations in chest X-ray reports and in discharge letters showed high F1-scores for the three subtasks: Detection of negated concepts in chest X-ray reports with an F1-score of 0.99 and in discharge letters with 0.97; of Boolean values in chest X-ray reports about 0.99, and of numerical values in chest X-ray reports and discharge letters also around 0.99 with the exception of the concept age. Discussion: The advantages of an ad hoc IE over a standard IE are the low development effort (just entering the concept with its variants), the promptness of the results and the adaptability by the user to his or her particular question. Disadvantage are usually lower accuracy and confidence.This ad hoc information extraction approach is novel and exceeds existing systems: Roogle [1] extracts predefined concepts from texts at preprocessing and makes them retrievable at runtime. Dr. Warehouse [2] applies negation detection and indexes the produced subtexts which include affirmed findings. Our approach combines negation detection and the extraction of concepts. But the extraction does not take place during preprocessing, but at runtime. That provides an ad hoc, dynamic, interactive and adjustable information extraction of random concepts and even their values on the fly at runtime. Conclusions: We developed an ad hoc information extraction query feature for Boolean and numerical values within a CDW with high recall and precision based on a pipeline that detects and removes negations and their scope in clinical texts.

Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Syntaktische Merkmale Des Substantivs. Eine Dependenzbaumbasierte Quantitative Untersuchung

Extracting terminologically relevant collocations in the translation of chinese monograph

A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events

Collocation Extraction Using Monolingual Word Alignment Method.

Association Measures for Collocation Extraction

Automated annotation of parallel bible corpora with cross-lingual semantic concordance

Ad Hoc Information Extraction for Clinical Data Warehouses

Casting a Wide Net: Robust Extraction of Potentially Idiomatic Expressions

A corpus-based analysis of adjective-noun collocations in the academic writing of native and non-native speakers of English

Testing the Use of a Collocation Retrieval Tool Without Prior Training by Learners of Spanish

A Corpus for Automatic Readability Assessment and Text Simplification of German

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation

Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation

Collocation Use in EFL Learners’ Writing Across Multiple Language Proficiencies: A Corpus-Driven Study

Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Learner's Corpus-based Study on the Learners' Verb-object-noun Collocation

Cross-Lingual Constituency Parsing for Middle High German: A Delexicalized Approach

Multilingual Event Extraction from Historical Newspaper Adverts

Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language