BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu,Yushi Hu,Bangzheng Li,Yu Feng,Haoyu Wang,Xudong Lin,Dan Roth,Noah A. Smith,Wei-Chiu Ma,Ranjay Krishna
2024-07-03
Abstract:We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current deficiency in the visual perception ability of multimodal language models (Multimodal LLMs). Specifically, although humans can solve some basic visual perception tasks, such as relative depth estimation, visual correspondence, forensic detection, and multi - view reasoning, "within a blink", existing multimodal language models perform poorly on these tasks because these tasks are difficult to be mediated and solved through natural language. The paper evaluates the core visual perception abilities of multimodal language models, which are often overlooked in other evaluations, by introducing a new benchmark - Blink. The Blink benchmark redesigns 14 classic computer vision tasks and converts them into 3,807 multiple - choice questions. Each question is accompanied by a single or multiple images and visual cues. These questions are designed to avoid simplifying the image into a text problem through dense captioning, but rather require the model to truly understand the image content to answer. The study found that even the most advanced models such as GPT - 4V and Gemini perform far below the human level on Blink, indicating that the visual perception abilities of multimodal language models have not been fully developed yet. Through this benchmark, the author hopes to reveal the gap between multimodal language models and humans in visual perception ability and provide directions for future research, especially on how to combine the advantages of specialized visual models to improve the performance of multimodal language models.