CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

Akash Ghosh,Arkadeep Acharya,Raghav Jain,Sriparna Saha,Aman Chadha,Setu Sinha
2023-12-16
Abstract:In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution to our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of patient question summarization in the medical field, especially when it is crucial to quickly and accurately understand patient needs in the face of numerous patient inquiries in modern healthcare systems. Current research mainly focuses on text-based summarization, neglecting the integration of visual information. The paper proposes a Multimodal Medical Question Summarization (MMQS) dataset and introduces a new framework called CLIPSyntel. ### Main Contributions Include: 1. **New Task**: Proposes a new task of generating medical question summaries, enhancing the accuracy of summaries using both image and text information. 2. **New Dataset**: Creates a multimodal medical question summarization dataset (MMQS Dataset) that includes both text and images. 3. **New Metric**: Proposes a new metric, MMFCM, to quantify the model's ability to capture multimodal information when generating summaries. 4. **New Framework**: Designs the CLIPSyntel framework, which combines Contrastive Language-Image Pre-training (CLIP) and Large Language Models (LLMs) to generate the final medical summary through four modules: - Medical Disease Identification Module - Relevant Context Generation Module - Context Filtering Module - Summary Generation Module ### Experimental Results: The paper validates the effectiveness of CLIPSyntel through various automatic evaluation metrics (such as ROUGE, BLEU, BERTScore) and human evaluation metrics (such as clinical evaluation score, factual recall rate, omission rate, and MMFCM score). Experimental results show that CLIPSyntel outperforms baseline models under various settings.