SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Youngjoon Yu,Sangyun Chung,Byung-Kwan Lee,Yong Man Ro
2024-10-11
Abstract:Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In this paper, we aim to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK that can reduce the fundamental multi-vision sensor information gap between images and multi-vision sensors. We generated 6,248 vision-language test samples to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency across different formats, covering different types of sensor-related questions. We utilized these samples to assess ten leading LVLMs. The results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents. Codes and data are available at <a class="link-external link-https" href="https://github.com/top-yun/SPARK" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the fundamental flaw in current large-scale vision-language models (LVLMs) when handling multi-view sensor data. Specifically, these models often treat different types of multi-view sensor data (such as thermal imaging, depth images, and medical X-ray images) as the same domain as RGB images, without considering the physical characteristics of the multi-view sensors. This leads to the models' inability to correctly convey the fundamental information and contextual knowledge of the multi-view sensors, making it difficult to answer complex sensor-related questions involving the physical environment. To tackle this challenge, the authors propose a new benchmark framework—SPARK, aimed at evaluating the capabilities of LVLMs in multi-view perception and multi-view reasoning. By generating 6,248 vision-language test samples, SPARK covers a variety of sensor-related questions in different formats, used to evaluate 10 leading LVLMs. Experimental results show that most models exhibit varying degrees of inadequacy in multi-view sensory reasoning. In summary, the main contributions of this paper include: 1. Revealing the limitations of current LVLMs in handling different multi-view sensor data, particularly the lack of understanding of the physical world of sensors. 2. Proposing a new benchmark framework SPARK, for rigorously testing and evaluating the capabilities of LVLMs in understanding and reasoning about multi-view sensor data. 3. Using the SPARK benchmark to evaluate 10 state-of-the-art LVLMs, validating their performance in handling fundamental knowledge of multi-view sensors.