Abstract:Annually and globally, over three billion radiography examinations and computer tomography scans result in mostly unstructured radiology reports containing free text. Despite the potential benefits of structured reporting, its adoption is limited by factors such as established processes, resource constraints and potential loss of information. However, structured information would be necessary for various use cases, including automatic analysis, clinical trial matching, and prediction of health outcomes. This study introduces RadEx, an end-to-end framework comprising 15 software components and ten artifacts to develop systems that perform automated information extraction from radiology reports. It covers the complete process from annotating training data to extracting information by offering a consistent generic information model and setting boundaries for model development. Specifically, RadEx allows clinicians to define relevant information for clinical domains (e.g., mammography) and to create report templates. The framework supports both generative and encoder-only models and the decoupling of information extraction from template filling enables independent model improvements. Developing information extraction systems according to the RadEx framework facilitates implementation and maintenance as components are easily exchangeable, while standardized artifacts ensure interoperability between components.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of extracting unstructured information from radiology reports. Over 3 billion radiological imaging examinations (such as X-rays and CT scans) are conducted globally each year, and the results of these examinations are typically recorded in free-text form in radiology reports. Although structured reporting has many potential benefits, such as automated analysis, clinical trial matching, and health outcome prediction, its application is limited by existing workflows, resource constraints, and potential information loss. The paper proposes an end-to-end framework named RadEx, which includes 15 software components and 10 artifacts for developing automated information extraction systems. The RadEx framework covers the entire process from annotating training data to extracting information, providing a general information model and setting boundary conditions for model development. Specifically, RadEx allows clinicians to define relevant information for specific clinical domains (such as mammography) and create report templates. The framework supports both generative and encoder-only models and achieves independent model improvements by separating information extraction and template filling. This makes the system easier to implement and maintain, while standardized artifacts ensure interoperability between components. ### Main Contributions 1. **Structured Information Extraction**: The RadEx framework provides an end-to-end solution to extract structured information from unstructured radiology reports. 2. **Flexibility and Scalability**: The framework is designed to be flexible, allowing clinicians to freely define the facts to be extracted, while supporting different types of models (generative and encoder-only models). 3. **Modular Design**: The separation of information extraction and template filling tasks reduces model complexity, allowing each task to be improved independently. 4. **Standardization and Interoperability**: Standardized artifacts ensure interoperability between components, facilitating system maintenance and expansion. ### Application Case The paper demonstrates the feasibility of the RadEx framework through a case study focused on extracting information from mammography reports. The research team first collaborated with clinicians and medical engineers to iteratively develop a fact model, including 24 facts, 24 corresponding anchor entities, and 66 modifiers. They then performed stratified sampling and annotation on 210 mammography reports. These annotated reports were used to fine-tune the medBERT.de model, which is based on the original BERT architecture and pre-trained on 4.7 million German medical documents. The final system includes an extractive question-answering pipeline and a sequence labeling pipeline, used to extract all facts from the reports and label all tokens in each fact, respectively. ### Conclusion The RadEx framework provides a set of components and artifacts for developing clinical information systems, which can be used to implement information extraction systems through specific methods. By providing standardized, reusable infrastructure, the framework helps avoid developers having to redevelop architectures from scratch, thereby accelerating the development process. The paper emphasizes the importance of continuous stakeholder involvement, particularly the involvement of clinicians, to ensure that the system meets the intended goals and does not introduce new challenges or obstacles in clinical workflows.

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

An Inclusive Task-Aware Framework for Radiology Report Generation

Information extraction from German radiological reports for general clinical text and language understanding

Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

A scoping review of large language model based approaches for information extraction from radiology reports

General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Structured Data from Chest Radiology Reports

Automatic structuring of radiology reports with on-premise open-source large language models

An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

Large language model-based information extraction from free-text radiology reports: a scoping review protocol

ReXplain: Translating Radiology into Patient-Friendly Video Reports

Non-Participation in a Randomized Controlled Trial: The Effect on Clinical and Non-Clinical Variables

RadLing: Towards Efficient Radiology Report Understanding

Large language models for structured reporting in radiology: past, present, and future

Act Like a Radiologist: Radiology Report Generation across Anatomical Regions

ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics

Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers

The Arabidopsis PILZ group genes encode tubulin-folding cofactor orthologs required for cell division but not cell growth.

Selective killing of B‐cell hybridomas targeting proteinase 3, Wegener's autoantigen

Uncovering Knowledge Gaps in Radiology Report Generation Models through Knowledge Graphs

RadTex: Learning Efficient Radiograph Representations from Text Reports

ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports