METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

Zhanyu Wang,Lingqiao Liu,Lei Wang,Luping Zhou
2023-04-05
Abstract:In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a "multi-expert joint diagnosis" mechanism to upgrade the existing "single expert" framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their overlap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing "single-expert" models to further improve its performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of automatic generation of medical imaging reports. Specifically, it focuses on the following aspects: 1. **Multi-Expert Joint Diagnosis Mechanism**: - Existing automated medical imaging report generation methods are mostly based on a "single expert" framework, where a single model generates the report. However, in clinical scenarios, multi-expert consultations are particularly beneficial for complex cases. Therefore, the authors propose a new framework—METransformer, which introduces multiple learnable "expert tokens" to mimic the process of multi-expert joint diagnosis. 2. **Improving Report Quality**: - Through the multi-expert mechanism, each expert token focuses on different regions of the image, and these tokens interact with each other to capture complementary information. This design helps to improve the quality and accuracy of the generated reports. 3. **Efficiency and Interactivity**: - Despite leveraging the advantages of ensemble methods, the design of METransformer is more computationally efficient and supports more complex interactions between experts. This allows the model to achieve better performance improvements without significantly increasing the number of parameters. 4. **Experimental Validation**: - The paper conducts experimental validation on two widely used benchmark datasets (IU-Xray and MIMIC-CXR), demonstrating the superior performance of METransformer in both natural language generation (NLG) metrics and clinical efficacy metrics. In summary, this paper aims to improve the performance of existing medical imaging report generation models through a multi-expert joint diagnosis mechanism, thereby enhancing the quality and accuracy of the generated reports.