LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation

Suhyeon Lee,Won Jun Kim,Jinho Chang,Jong Chul Ye
2024-03-18
Abstract:Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at <a class="link-external link-https" href="https://github.com/hyn2028/llm-cxr" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of how to improve the multimodal processing capabilities of large language models (LLMs) in understanding and generating chest X-ray images (CXR). Specifically, the researchers propose a method called LLM-CXR, which enables pre-trained text-based LLMs to understand and generate visual information, especially medical imaging data, through instruction-finetuning. The main objectives of LLM-CXR include: 1. **Enhancing the fusion of visual and language features**: By allowing more free interaction between image and text features, to improve the understanding and generation capabilities of medical images such as chest X-rays. 2. **Avoiding catastrophic forgetting**: While increasing visual processing capabilities, maintaining the original language understanding and reasoning abilities. 3. **Achieving tighter modality mapping**: Ensuring a closer mapping relationship between text and images, especially in the medical field where precise description and diagnosis are required. To achieve these goals, the authors took the following key steps: - Using VQ-GAN to encode images, converting them into token forms similar to text, making it easier for LLMs to process. - Extending the token embedding space of LLMs to accommodate image tokens without losing their language processing capabilities. - Adopting an instruction-finetuning approach, utilizing diverse tasks to guide LLMs in learning how to handle image inputs and generate corresponding outputs based on these inputs. - Enhancing training data through synthetic visual question answering (VQA) to further improve the model's multimodal understanding capabilities. Experimental results show that LLM-CXR performs excellently in tasks such as CXR-to-report generation, CXR-based visual question answering, and report-to-CXR generation, especially when compared to other existing models. This indicates that the proposed LLM-CXR method effectively enhances the multimodal capabilities of LLMs in handling medical images.