Abstract:Surgery requires comprehensive medical knowledge, visual assessment skills, and procedural expertise. While recent surgical AI models have focused on solving task-specific problems, there is a need for general-purpose systems that can understand surgical scenes and interact through natural language. This paper introduces GP-VLS, a general-purpose vision language model for surgery that integrates medical and surgical knowledge with visual scene understanding. For comprehensively evaluating general-purpose surgical models, we propose SurgiQual, which evaluates across medical and surgical knowledge benchmarks as well as surgical vision-language questions. To train GP-VLS, we develop six new datasets spanning medical knowledge, surgical textbooks, and vision-language pairs for tasks like phase recognition and tool identification. We show that GP-VLS significantly outperforms existing open- and closed-source models on surgical vision-language tasks, with 8-21% improvements in accuracy across SurgiQual benchmarks. GP-VLS also demonstrates strong performance on medical and surgical knowledge tests compared to open-source alternatives. Overall, GP-VLS provides an open-source foundation for developing AI assistants to support surgeons across a wide range of tasks and scenarios. The code and data for this work is publicly available at <a class="link-external link-http" href="http://gpvls-surgery-vlm.github.io" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to develop a general - purpose visual - language model (GP - VLS) that can understand and process surgical scenes and interact with clinicians through natural language. Specifically, the paper aims to: 1. **Create a general - purpose surgical visual - language model**: Most existing surgical AI models focus on solving problems for specific tasks and lack a system that can understand and operate in surgical scenes in a broader range of tasks and scenarios. GP - VLS aims to fill this gap and provide a general - purpose platform that can understand medical and surgical knowledge and combine visual - scene understanding. 2. **Evaluate the quality of the general - purpose surgical model**: In order to comprehensively evaluate the effectiveness of such general - purpose surgical models, the author proposes a new evaluation metric - SurgiQual. This metric not only covers benchmark tests of medical and surgical knowledge but also includes the evaluation of surgical visual - language problems. 3. **Develop new training datasets**: In order to train GP - VLS, the author has developed six new datasets. These datasets cover medical knowledge, surgical textbooks, and visual - language equivalent tasks such as phase identification and tool identification. These datasets provide rich training materials for the model, enabling it to better understand and process complex surgical scenes. ### Main contributions 1. **Open - source general - purpose surgical visual - language model (GP - VLS)**: This model can not only understand the basic concepts of medicine and surgery but also handle complex visual - language problems. 2. **Comprehensive evaluation metric (SurgiQual)**: Used to evaluate the ability of surgical visual - language models in medical and surgical knowledge and visual - scene understanding. 3. **Six new surgical training datasets**: Including five visual - language datasets and one dataset from surgical textbooks, covering a wide range of surgical tasks. ### Solutions By integrating medical and surgical knowledge with visual - scene understanding, GP - VLS can support surgeons' work in multiple aspects, from preoperative planning to intraoperative guidance to postoperative care. In addition, the model also has the ability to explain its reasoning process, which is crucial for ensuring that technology enhances rather than replaces human expertise. ### Summary GP - VLS represents an important progress in the development of general - purpose surgical AI assistants. By combining medical knowledge with specialized surgical understanding and visual understanding, it lays the foundation for language - based surgical AI systems. Although challenges still exist, the potential benefits of this model in surgical practice are huge.

GP-VLS: A general-purpose vision language model for surgery

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

General surgery vision transformer: A video pre-trained foundation model for general surgery

VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

Advancing Surgical VQA with Scene Graph Knowledge

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

General-purpose foundation models for increased autonomy in robot-assisted surgery

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Computer vision in surgery

Water and nutrient transport on a heavy clay soil in a fluvial plain in the Netherlands.