SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Lalithkumar Seenivasan,Mobarakol Islam,Gokul Kannan,Hongliang Ren
2023-07-22
Abstract:Advances in GPT-based large language models (LLMs) are revolutionizing natural language processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision and language processing, models with bi-directional attention or models employing fusion techniques are often employed to capture the context of multiple modalities all at once. As GPT does not natively process vision tokens, to exploit the advancements in GPT models for VQA in robotic surgery, we design an end-to-end trainable Language-Vision GPT (LV-GPT) model that expands the GPT2 model to include vision input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and vision token embedding (token type and pose). Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image. Quantitatively, we prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets (based on endoscopic vision challenge robotic scene segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset (based on the holistic surgical scene dataset). We further annotate all three datasets to include question-type annotations to allow sub-type analysis. Furthermore, we extensively study and present the effects of token sequencing, token type and pose embedding for vision tokens in the LV-GPT model.
Computer Vision and Pattern Recognition,Artificial Intelligence,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve Visual Question Answering (VQA) in surgical scenarios. Specifically, the author aims to develop an end - to - end trainable multi - modal Language - Vision GPT (LV - GPT) to improve the performance of visual question answering in surgical scenarios. Although existing unimodal language models (such as GPT) perform excellently in natural language processing, they cannot directly infer answers from medical images. Therefore, this research enables the GPT model to process text and image data simultaneously by introducing vision tokens and optimizing the token serialization method, thus achieving better results in the visual question - answering tasks in surgical scenarios. ### Main Contributions 1. **Model Design**: - Designed an end - to - end trainable multi - modal LV - GPT model, which extends the GPT2 model to enable it to handle visual inputs (images). - Introduced a feature extractor (vision tokenizer) and vision token embeddings (token type and pose). 2. **Token Serialization**: - By arranging word tokens before vision tokens, imitating the process of humans inferring answers from images after understanding the questions, the performance of the model is improved. 3. **Experimental Verification**: - Conducted extensive experiments on three publicly available surgical VQA datasets (EndoVis18 - VQA, Cholec80 - VQA, and the newly annotated PSI - AVA - VQA), proving that the performance of the LV - GPT model on these datasets is better than other state - of - the - art VQA models. 4. **New Dataset**: - Introduced a new surgical VQA dataset, PSI - AVA - VQA. By performing VQA annotation on the existing holistic surgical scene dataset, the effectiveness of the model is further verified. ### Experimental Results - **Quantitative Analysis**: The accuracy (Acc), recall (Recall), and F - score (FScore) of the LV - GPT model on the EndoVis18 - VQA and Cholec80 - VQA datasets are increased by approximately 3 - 5% respectively. - **Qualitative Analysis**: By comparing with existing models (such as VisualBert, VisualBert RM, Block, etc.), the superior performance of the LV - GPT model in different types of surgical scenarios is demonstrated. ### Conclusion By designing the end - to - end trainable LV - GPT model and optimizing the token serialization method, this research has successfully improved the performance of visual question - answering tasks in surgical scenarios. In addition, by introducing a new dataset and detailed experimental analysis, the effectiveness and robustness of the model are further verified. Future work can explore applying this model to a wider range of medical image and video processing tasks.