Abstract:Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at <a class="link-external link-https" href="https://github.com/hyn2028/llm-cxr" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address the issue of how to improve the multimodal processing capabilities of large language models (LLMs) in understanding and generating chest X-ray images (CXR). Specifically, the researchers propose a method called LLM-CXR, which enables pre-trained text-based LLMs to understand and generate visual information, especially medical imaging data, through instruction-finetuning. The main objectives of LLM-CXR include: 1. **Enhancing the fusion of visual and language features**: By allowing more free interaction between image and text features, to improve the understanding and generation capabilities of medical images such as chest X-rays. 2. **Avoiding catastrophic forgetting**: While increasing visual processing capabilities, maintaining the original language understanding and reasoning abilities. 3. **Achieving tighter modality mapping**: Ensuring a closer mapping relationship between text and images, especially in the medical field where precise description and diagnosis are required. To achieve these goals, the authors took the following key steps: - Using VQ-GAN to encode images, converting them into token forms similar to text, making it easier for LLMs to process. - Extending the token embedding space of LLMs to accommodate image tokens without losing their language processing capabilities. - Adopting an instruction-finetuning approach, utilizing diverse tasks to guide LLMs in learning how to handle image inputs and generate corresponding outputs based on these inputs. - Enhancing training data through synthetic visual question answering (VQA) to further improve the model's multimodal understanding capabilities. Experimental results show that LLM-CXR performs excellently in tasks such as CXR-to-report generation, CXR-based visual question answering, and report-to-CXR generation, especially when compared to other existing models. This indicates that the proposed LLM-CXR method effectively enhances the multimodal capabilities of LLMs in handling medical images.

LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation

CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images

M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

WoLF: Wide-scope Large Language Model Framework for CXR Understanding

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue