Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

Juseong Jin,Chang Wook Jeong
2024-10-13
Abstract:Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Existing large - scale vision - language models (LVLMs) perform poorly when dealing with surgical scenes because these models are usually trained for common scenes and have limitations in understanding and processing surgical - related image and video data. Specifically: 1. **Limitations of existing models**: - Vision - language models in the general domain, when dealing with surgery - related problems, may act like laymen, either avoiding surgery - related questions or providing incorrect or completely fictional answers. - Previous methods usually regarded surgical visual question answering (VQA) as a classification task, which limited the flexibility of conversational generation AI in surgical applications. 2. **Unique requirements of surgical scenes**: - A large amount of visual data (such as static images and dynamic videos) generated during the surgical process requires a multimodal model that can understand visual and text information simultaneously. - Surgical vision - text pairs are significantly different from ordinary network content, so a specially designed model is required to process these data. To solve these problems, the authors proposed the **Surgical - LLaVA** model, which is a large - scale vision - language model specifically designed for surgical scenes. By integrating the visual representations of surgical images and videos into the language feature space, Surgical - LLaVA demonstrates excellent multimodal conversation capabilities in surgical scenes and achieves better performance on visual question - answering datasets than existing models. ### Main contributions 1. **Proposing Surgical - LLaVA**: Combining the language understanding capabilities of large - language models (LLMs) and pre - trained visual encoders to handle spatio - temporal representations during the surgical process. 2. **Constructing a high - quality surgical visual instruction dataset**: Generating high - quality surgical visual instruction pairs through a scalable and diverse annotation framework. 3. **Achieving performance superior to existing models**: Performing well in video reasoning and visual question - answering tasks in surgical scenes, indicating that the model has the potential to handle more complex surgical scenes. ### Core technologies of the solution - **Multimodal instruction fine - tuning**: Using GPT - 3.5 to generate diverse surgical multimodal instruction - following data and fine - tuning the model. - **Joint contrastive learning**: Aligning data of different modalities through contrastive learning techniques to improve the model's understanding ability of visual and language information. - **Multi - round dialogue mechanism**: Allowing the model to maintain context in multi - round dialogues, thereby better understanding and responding to complex surgical scenes. Through these technological innovations, Surgical - LLaVA aims to enhance the multimodal interaction capabilities in surgical scenes and provide strong support for application scenarios such as surgical training, decision - support, and patient care.