BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu,Yushi Hu,Bangzheng Li,Yu Feng,Haoyu Wang,Xudong Lin,Dan Roth,Noah A. Smith,Wei-Chiu Ma,Ranjay Krishna

2024-07-03

Abstract:We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current deficiency in the visual perception ability of multimodal language models (Multimodal LLMs). Specifically, although humans can solve some basic visual perception tasks, such as relative depth estimation, visual correspondence, forensic detection, and multi - view reasoning, "within a blink", existing multimodal language models perform poorly on these tasks because these tasks are difficult to be mediated and solved through natural language. The paper evaluates the core visual perception abilities of multimodal language models, which are often overlooked in other evaluations, by introducing a new benchmark - Blink. The Blink benchmark redesigns 14 classic computer vision tasks and converts them into 3,807 multiple - choice questions. Each question is accompanied by a single or multiple images and visual cues. These questions are designed to avoid simplifying the image into a text problem through dense captioning, but rather require the model to truly understand the image content to answer. The study found that even the most advanced models such as GPT - 4V and Gemini perform far below the human level on Blink, indicating that the visual perception abilities of multimodal language models have not been fully developed yet. Through this benchmark, the author hopes to reveal the gap between multimodal language models and humans in visual perception ability and provide directions for future research, especially on how to combine the advantages of specialized visual models to improve the performance of multimodal language models.

BLINK: Multimodal Large Language Models Can See but Not Perceive

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Exploring Perceptual Limitation of Multimodal Large Language Models

Evaluating and Advancing Multimodal Large Language Models in Ability Lens

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Vision language models are blind

LMEye: An Interactive Perception Network for Large Language Models

Are We on the Right Way for Evaluating Large Vision-Language Models?

Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

A Survey on Evaluation of Multimodal Large Language Models

Effectiveness Assessment of Recent Large Vision-Language Models

Language Is Not All You Need: Aligning Perception with Language Models