Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Songjian Dan,Wei Feng
DOI: https://doi.org/10.1007/s00500-023-09536-4
IF: 3.732
2024-01-19
Soft Computing
Abstract:The robot video question-answering system is an artificial intelligence application that integrates computer vision and natural language processing technologies. Recently, it has received widespread attention, especially with the rapid development of large language models (LLMs). The core technical challenge lies in the application of visual question answering (VQA). However, visual question answering currently faces several challenges. Firstly, the acquisition of human annotations is costly, and secondly, existing models require expensive retraining when replacing a particular module. We propose the VLM2LLM model, which significantly improves the performance of multimodal question-answering tasks by integrating visual-language matching and large-scale language models. Specifically, it overcomes the limitations of requiring massive computational resources for training and inference in previous models. Furthermore, it allows for the upgrading of our LLM version according to the latest research advancements and needs. The results demonstrate that the VLM2LLM model achieves the highest accuracy compared to other state-of-the-art models on three datasets: QAv2, A-OKVQA, and OK-VQA. We hope that the VLM2LLM model can drive advancements in the field of robot video question-answering and provide innovative solutions for a wider range of application domains.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?