Abstract:Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges faced in multimodal data processing in the field of radiology, especially the problem of how to maintain model performance when dealing with low - quality or incomplete real - world medical data. Specifically: 1. **Complexity of multimodal data processing**: Traditional multimodal models (LMMs) usually require a large amount of domain - specific data for pre - training and fine - tuning, which is not only time - consuming but also costly. In addition, these models are highly dependent on high - quality data and perform poorly in the face of data changes and errors in the real world. 2. **Impact of low - quality data**: The quality and sensitivity of medical data make the data collection and annotation processes challenging, especially in radiology image analysis, where there is a relatively high error rate (such as 3% - 5%). Electronic health records (EHR) also have an error rate of about 9% - 10%. These problems lead to a decline in data quality, which in turn affects the performance of the model. 3. **Reducing dependence on large - scale pre - training and fine - tuning**: The paper proposes a new framework, MID - M, which uses large - language models (LLMs) in the general domain to process multimodal data through context - learning capabilities without extensive pre - training or fine - tuning for specific tasks. This not only reduces resource requirements but also improves the robustness of the model on low - quality data. ### Main features of the MID - M framework - **Using unimodal methods to handle multimodal tasks**: MID - M converts images into text descriptions, thereby simplifying multimodal tasks into text - to - text tasks. This method not only reduces computational requirements but also makes image representations more intuitive and interpretable. - **Robustness**: Experiments show that MID - M performs excellently when dealing with low - quality or incomplete medical data, and even outperforms other models trained with a large number of parameters and domain - specific data. - **Efficiency and economy**: By using a smaller - scale language model, MID - M can significantly reduce the demand for computational resources while maintaining performance, making it more suitable for resource - limited environments. ### Conclusion This research shows how to use large - language models in the general domain to address multimodal challenges in the medical field, especially in cases where data quality is not high. The MID - M framework provides a sustainable and cost - effective solution with broad application prospects, especially in medical image analysis.

Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model

Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook

Utility of Multimodal Large Language Models in Analyzing Chest X-ray with Incomplete Contextual Information

Exploring Multimodal Large Language Models for Radiology Report Error-checking

Multi-modal large language models in radiology: principles, applications, and potential

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Multimodal Large Language Models for Bioimage Analysis

From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice

SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation

An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation

Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: Applications, Challenges, and Future Directions

Multimodal Large Language Models are Generalist Medical Image Interpreters

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model

Complementary Information Mutual Learning for Multimodality Medical Image Segmentation

Large Language Models: A Guide for Radiologists

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

A Survey of Multimodal Large Language Model from A Data-centric Perspective