METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

Zhanyu Wang,Lingqiao Liu,Lei Wang,Luping Zhou

2023-04-05

Abstract:In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a "multi-expert joint diagnosis" mechanism to upgrade the existing "single expert" framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their overlap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing "single-expert" models to further improve its performance.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of automatic generation of medical imaging reports. Specifically, it focuses on the following aspects: 1. **Multi-Expert Joint Diagnosis Mechanism**: - Existing automated medical imaging report generation methods are mostly based on a "single expert" framework, where a single model generates the report. However, in clinical scenarios, multi-expert consultations are particularly beneficial for complex cases. Therefore, the authors propose a new framework—METransformer, which introduces multiple learnable "expert tokens" to mimic the process of multi-expert joint diagnosis. 2. **Improving Report Quality**: - Through the multi-expert mechanism, each expert token focuses on different regions of the image, and these tokens interact with each other to capture complementary information. This design helps to improve the quality and accuracy of the generated reports. 3. **Efficiency and Interactivity**: - Despite leveraging the advantages of ensemble methods, the design of METransformer is more computationally efficient and supports more complex interactions between experts. This allows the model to achieve better performance improvements without significantly increasing the number of parameters. 4. **Experimental Validation**: - The paper conducts experimental validation on two widely used benchmark datasets (IU-Xray and MIMIC-CXR), demonstrating the superior performance of METransformer in both natural language generation (NLG) metrics and clinical efficacy metrics. In summary, this paper aims to improve the performance of existing medical imaging report generation models through a multi-expert joint diagnosis mechanism, thereby enhancing the quality and accuracy of the generated reports.

METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

Mmformer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation

Multi-modal transformer architecture for medical image analysis and automated report generation

Automated Radiographic Report Generation Purely on Transformer: A Multicriteria Supervised Approach

MATNet: Exploiting Multi-Modal Features for Radiology Report Generation.

Generating Radiology Reports Via Multi-feature Optimization Transformer

MeFormer: Generating Radiology Reports via Memory Enhanced Pretraining Transformer

A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics

MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation

Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

Toward an enhanced automatic medical report generator based on large transformer models

TransMed: Transformers Advance Multi-Modal Medical Image Classification

Vision Transformer and Language Model Based Radiology Report Generation

MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network

Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures

Model long-range dependencies for multi-modality and multi-view retinopathy diagnosis through transformers

ClassFormer: Exploring Class-Aware Dependency with Transformer for Medical Image Segmentation