Abstract:Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates $6$ categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on $47$ standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM with moderate instruction-following data may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions). It either makes the current evaluation metric such as CIDEr for image captioning ineffective or generates wrong answers. Third, employing a multi-turn reasoning evaluation framework can mitigate the issue of object hallucination, shedding light on developing an effective pipeline for LVLM evaluation. The findings provide a foundational framework for the conception and assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques. Our LVLM-eHub will be available at <a class="link-external link-https" href="https://github.com/OpenGVLab/Multi-Modality-Arena" rel="external noopener nofollow">this https URL</a>

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

A Survey on Multimodal Large Language Models for Autonomous Driving

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Vision Language Models in Autonomous Driving: A Survey and Outlook

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Semantic Understanding of Traffic Scenes with Large Vision Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models