Abstract:Background: Information extraction techniques that get structured representations out of unstructured data make a large amount of clinically relevant information about patients accessible for semantic applications. These methods typically rely on standardized terminologies that guide this process. Many languages and clinical domains, however, lack appropriate resources and tools, as well as evaluations of their applications, especially if detailed conceptualizations of the domain are required. For instance, German transthoracic echocardiography reports have not been targeted sufficiently before, despite of their importance for clinical trials. This work therefore aimed at development and evaluation of an information extraction component with a fine-grained terminology that enables to recognize almost all relevant information stated in German transthoracic echocardiography reports at the University Hospital of Würzburg. Methods: A domain expert validated and iteratively refined an automatically inferred base terminology. The terminology was used by an ontology-driven information extraction system that outputs attribute value pairs. The final component has been mapped to the central elements of a standardized terminology, and it has been evaluated according to documents with different layouts. Results: The final system achieved state-of-the-art precision (micro average.996) and recall (micro average.961) on 100 test documents that represent more than 90 % of all reports. In particular, principal aspects as defined in a standardized external terminology were recognized with f 1=.989 (micro average) and f 1=.963 (macro average). As a result of keyword matching and restraint concept extraction, the system obtained high precision also on unstructured or exceptionally short documents, and documents with uncommon layout. Conclusions: The developed terminology and the proposed information extraction system allow to extract fine-grained information from German semi-structured transthoracic echocardiography reports with very high precision and high recall on the majority of documents at the University Hospital of Würzburg. Extracted results populate a clinical data warehouse which supports clinical research.

Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals

Fine-grained information extraction from German transthoracic echocardiography reports

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Automatic Knowledge Structuration of Automotive User Manual for Question Answering

Automatic Extraction of Domain-Specific Terms

Knowledge Extraction from the Language Extended Lexicon Glossary Using Natural Language Processing

Towards information extraction from ISR reports for decision support using a two-stage learning-based approach

Automating the Information Extraction from Semi-Structured Interview Transcripts

Natural Language Processing for Requirements Formalization: How to Derive New Approaches?

Semi-Automatically Extracting FAQs to Improve Accessibility of Software Development Knowledge

Treatment of Semantic Heterogeneity in Information Retrieval

Legal information retrieval for understanding statutory terms

Automated concept-level information extraction to reduce the need for custom software and rules development

Automatic extraction of requirements expressed in industrial standards : a way towards machine readable standards ?

Autonomous requirements specification processing using natural language processing

Extracting Semantics from Maintenance Records

Beyond Rule-based Named Entity Recognition and Relation Extraction for Process Model Generation from Natural Language Text

Natural language processing for word sense disambiguation and information extraction

From Requirements to Architecture: An AI-Based Journey to Semi-Automatically Generate Software Architectures

Knowing-how & Knowing-that: A New Task for Machine Comprehension of User Manuals

Large-scale information retrieval in software engineering -- an experience report from industrial application