Abstract:Advances in GPT-based large language models (LLMs) are revolutionizing natural language processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision and language processing, models with bi-directional attention or models employing fusion techniques are often employed to capture the context of multiple modalities all at once. As GPT does not natively process vision tokens, to exploit the advancements in GPT models for VQA in robotic surgery, we design an end-to-end trainable Language-Vision GPT (LV-GPT) model that expands the GPT2 model to include vision input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and vision token embedding (token type and pose). Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image. Quantitatively, we prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets (based on endoscopic vision challenge robotic scene segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset (based on the holistic surgical scene dataset). We further annotate all three datasets to include question-type annotations to allow sub-type analysis. Furthermore, we extensively study and present the effects of token sequencing, token type and pose embedding for vision tokens in the LV-GPT model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve Visual Question Answering (VQA) in surgical scenarios. Specifically, the author aims to develop an end - to - end trainable multi - modal Language - Vision GPT (LV - GPT) to improve the performance of visual question answering in surgical scenarios. Although existing unimodal language models (such as GPT) perform excellently in natural language processing, they cannot directly infer answers from medical images. Therefore, this research enables the GPT model to process text and image data simultaneously by introducing vision tokens and optimizing the token serialization method, thus achieving better results in the visual question - answering tasks in surgical scenarios. ### Main Contributions 1. **Model Design**: - Designed an end - to - end trainable multi - modal LV - GPT model, which extends the GPT2 model to enable it to handle visual inputs (images). - Introduced a feature extractor (vision tokenizer) and vision token embeddings (token type and pose). 2. **Token Serialization**: - By arranging word tokens before vision tokens, imitating the process of humans inferring answers from images after understanding the questions, the performance of the model is improved. 3. **Experimental Verification**: - Conducted extensive experiments on three publicly available surgical VQA datasets (EndoVis18 - VQA, Cholec80 - VQA, and the newly annotated PSI - AVA - VQA), proving that the performance of the LV - GPT model on these datasets is better than other state - of - the - art VQA models. 4. **New Dataset**: - Introduced a new surgical VQA dataset, PSI - AVA - VQA. By performing VQA annotation on the existing holistic surgical scene dataset, the effectiveness of the model is further verified. ### Experimental Results - **Quantitative Analysis**: The accuracy (Acc), recall (Recall), and F - score (FScore) of the LV - GPT model on the EndoVis18 - VQA and Cholec80 - VQA datasets are increased by approximately 3 - 5% respectively. - **Qualitative Analysis**: By comparing with existing models (such as VisualBert, VisualBert RM, Block, etc.), the superior performance of the LV - GPT model in different types of surgical scenarios is demonstrated. ### Conclusion By designing the end - to - end trainable LV - GPT model and optimizing the token serialization method, this research has successfully improved the performance of visual question - answering tasks in surgical scenarios. In addition, by introducing a new dataset and detailed experimental analysis, the effectiveness and robustness of the model are further verified. Future work can explore applying this model to a wider range of medical image and video processing tasks.

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

GP-VLS: A general-purpose vision language model for surgery

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps

Dual modality prompt learning for visual question-grounded answering in robotic surgery

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day