ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation

Siyou Li,Beining Xu,Yihao Luo,Dong Nie,Le Zhang
2024-10-11
Abstract:Automatic medical report generation (MRG), which aims to produce detailed text reports from medical images, has emerged as a critical task in this domain. MRG systems can enhance radiological workflows by reducing the time and effort required for report writing, thereby improving diagnostic efficiency. In this work, we present a novel approach for automatic MRG utilizing a multimodal large language model. Specifically, we employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding. The experiment shows our model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the visual question answering (VQA) task validation set, outperforming the baseline model. Our approach demonstrates the effectiveness of the ViT3D alignment of LLaMA3 for automatic MRG and VQA tasks by tuning the model on a small dataset.
Image and Video Processing,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of Medical Report Generation (MRG). Specifically, the research team proposes a novel approach that utilizes a multimodal large language model to automatically generate detailed medical imaging reports. By combining a 3D Vision Transformer (ViT3D) image encoder to process three-dimensional scan images and using the Asclepius-Llama3-8B language model for autoregressive decoding to generate text reports, this method can fine-tune the model on smaller datasets, thereby achieving efficient generation of medical imaging reports. Additionally, the study demonstrates the effectiveness of their method on the Visual Question Answering (VQA) task. Experimental results show that their model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the VQA task validation set, both outperforming the baseline models. This outcome indicates the advantages of combining ViT3D and LLaMA3 in automatic medical report generation and VQA tasks.