Abstract:Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates $6$ categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on $47$ standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM with moderate instruction-following data may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions). It either makes the current evaluation metric such as CIDEr for image captioning ineffective or generates wrong answers. Third, employing a multi-turn reasoning evaluation framework can mitigate the issue of object hallucination, shedding light on developing an effective pipeline for LVLM evaluation. The findings provide a foundational framework for the conception and assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques. Our LVLM-eHub will be available at <a class="link-external link-https" href="https://github.com/OpenGVLab/Multi-Modality-Arena" rel="external noopener nofollow">this https URL</a>

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

DeepSeek-VL: Towards Real-World Vision-Language Understanding

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Qwen Technical Report

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

HumanVLM: Foundation for Human-Scene Vision-Language Model

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Small Language Model Meets with Reinforced Vision Vocabulary

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Unified Vision-Language Pre-Training for Image Captioning and VQA

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences