Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

Lijian Xu,Ziyu Ni,Xinglong Liu,Xiaosong Wang,Hongsheng Li,Shaoting Zhang

2024-03-04

Abstract:The emergence of multi-modal deep learning models has made significant impacts on clinical applications in the last decade. However, the majority of models are limited to single-tasking, without considering disease diagnosis is indeed a multi-task procedure. Here, we demonstrate a unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction tuning. We first compose a multi-task training dataset comprising 13.4 million instruction and ground-truth pairs (with approximately one million radiographs) for the customized tuning, involving both image- and pixel-level tasks. Thus, we can unify the various vision-intensive tasks in a single training framework with homogeneous model inputs and outputs to increase clinical interpretability in one reading. Finally, we demonstrate the overall superior performance of our model compared to prior arts on various chest X-ray benchmarks across multi-tasks in both direct inference and finetuning settings. Three radiologists further evaluate the generated reports against the recorded ones, which also exhibit the enhanced explainability of our multi-task model.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving multi-task processing in chest X-ray image interpretation to improve the accuracy and interpretability of computer-aided diagnosis. Specifically, the research team proposes a multimodal model named OmniFM-DR, which aims to handle various tasks, including disease classification, localization, segmentation, and report generation, through a unified Transformer architecture. The main objectives of the paper are: 1. **Unified Input-Output Format**: By customizing the instruction adjustment framework, the input and output labels of different sub-tasks are unified into a consistent format to facilitate joint training within a single training framework. 2. **Improved Multi-Task Performance**: The model demonstrates better performance than existing methods on multiple benchmark datasets, especially in direct inference and fine-tuning settings. 3. **Enhanced Interpretability**: By providing detailed disease attribute information (such as size, location, severity, etc.), the interpretability of X-ray reports is enhanced, thereby improving the practicality in clinical applications. 4. **Comprehensive Evaluation**: Through the evaluation by radiologists, the quality of automatically generated reports is verified to be comparable to or even better than those provided by actual doctors, especially in terms of detailed descriptions.

Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

A Transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics

A Multi-Task Transformer with Local-Global Feature Interaction and Multiple Tumoral Region Guidance for Breast Cancer Diagnosis

Research and implementation of multi-disease diagnosis on chest X-ray based on vision transformer

A Multi-Stage Framework for Joint Chest X-Ray Diagnosis and Visual Attention Prediction Using Deep Learning

Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures

MVC: A Multi-Task Vision Transformer Network for COVID-19 Diagnosis from Chest X-ray Images

UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification

METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

Enhancing CT Image synthesis from multi-modal MRI data based on a multi-task neural network framework

Analyzing Transfer Learning of Vision Transformers for Interpreting Chest Radiography

Using Multi-Task Learning to Improve Diagnostic Performance of Convolutional Neural Networks

SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification

Label Correlation Transformer for Automated Chest X-ray Diagnosis with Reliable Interpretability

Chest L-Transformer: Local Features with Position Attention for Weakly Supervised Chest Radiograph Segmentation and Classification

CST: A Multitask Learning Framework for Colorectal Cancer Region Mining Based on Transformer

SwinCheX: Multi-label classification on chest X-ray images with transformers

Dual-Input Transformer: An End-to-End Model for Preoperative Assessment of Pathological Complete Response to Neoadjuvant Chemotherapy in Breast Cancer Ultrasonography

Interpretable CNN-Multilevel Attention Transformer for Rapid Recognition of Pneumonia from Chest X-Ray Images

CancerUniT: Towards a Single Unified Model for Effective Detection, Segmentation, and Diagnosis of Eight Major Cancers Using a Large Collection of CT Scans