RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader,Jonas Knupp,André Sander,Kerstin Denecke
2024-06-14
Abstract:Annually and globally, over three billion radiography examinations and computer tomography scans result in mostly unstructured radiology reports containing free text. Despite the potential benefits of structured reporting, its adoption is limited by factors such as established processes, resource constraints and potential loss of information. However, structured information would be necessary for various use cases, including automatic analysis, clinical trial matching, and prediction of health outcomes. This study introduces RadEx, an end-to-end framework comprising 15 software components and ten artifacts to develop systems that perform automated information extraction from radiology reports. It covers the complete process from annotating training data to extracting information by offering a consistent generic information model and setting boundaries for model development. Specifically, RadEx allows clinicians to define relevant information for clinical domains (e.g., mammography) and to create report templates. The framework supports both generative and encoder-only models and the decoupling of information extraction from template filling enables independent model improvements. Developing information extraction systems according to the RadEx framework facilitates implementation and maintenance as components are easily exchangeable, while standardized artifacts ensure interoperability between components.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of extracting unstructured information from radiology reports. Over 3 billion radiological imaging examinations (such as X-rays and CT scans) are conducted globally each year, and the results of these examinations are typically recorded in free-text form in radiology reports. Although structured reporting has many potential benefits, such as automated analysis, clinical trial matching, and health outcome prediction, its application is limited by existing workflows, resource constraints, and potential information loss. The paper proposes an end-to-end framework named RadEx, which includes 15 software components and 10 artifacts for developing automated information extraction systems. The RadEx framework covers the entire process from annotating training data to extracting information, providing a general information model and setting boundary conditions for model development. Specifically, RadEx allows clinicians to define relevant information for specific clinical domains (such as mammography) and create report templates. The framework supports both generative and encoder-only models and achieves independent model improvements by separating information extraction and template filling. This makes the system easier to implement and maintain, while standardized artifacts ensure interoperability between components. ### Main Contributions 1. **Structured Information Extraction**: The RadEx framework provides an end-to-end solution to extract structured information from unstructured radiology reports. 2. **Flexibility and Scalability**: The framework is designed to be flexible, allowing clinicians to freely define the facts to be extracted, while supporting different types of models (generative and encoder-only models). 3. **Modular Design**: The separation of information extraction and template filling tasks reduces model complexity, allowing each task to be improved independently. 4. **Standardization and Interoperability**: Standardized artifacts ensure interoperability between components, facilitating system maintenance and expansion. ### Application Case The paper demonstrates the feasibility of the RadEx framework through a case study focused on extracting information from mammography reports. The research team first collaborated with clinicians and medical engineers to iteratively develop a fact model, including 24 facts, 24 corresponding anchor entities, and 66 modifiers. They then performed stratified sampling and annotation on 210 mammography reports. These annotated reports were used to fine-tune the medBERT.de model, which is based on the original BERT architecture and pre-trained on 4.7 million German medical documents. The final system includes an extractive question-answering pipeline and a sequence labeling pipeline, used to extract all facts from the reports and label all tokens in each fact, respectively. ### Conclusion The RadEx framework provides a set of components and artifacts for developing clinical information systems, which can be used to implement information extraction systems through specific methods. By providing standardized, reusable infrastructure, the framework helps avoid developers having to redevelop architectures from scratch, thereby accelerating the development process. The paper emphasizes the importance of continuous stakeholder involvement, particularly the involvement of clinicians, to ensure that the system meets the intended goals and does not introduce new challenges or obstacles in clinical workflows.