Abstract:Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates $6$ categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on $47$ standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM with moderate instruction-following data may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions). It either makes the current evaluation metric such as CIDEr for image captioning ineffective or generates wrong answers. Third, employing a multi-turn reasoning evaluation framework can mitigate the issue of object hallucination, shedding light on developing an effective pipeline for LVLM evaluation. The findings provide a foundational framework for the conception and assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques. Our LVLM-eHub will be available at <a class="link-external link-https" href="https://github.com/OpenGVLab/Multi-Modality-Arena" rel="external noopener nofollow">this https URL</a>

Evaluating the Representational Hub of Language and Vision Models

Visualizing and Understanding Neural Models in NLP

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models

Revealing Vision-Language Integration in the Brain with Multimodal Networks

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities

HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Visual cognition in multimodal large language models

A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Towards Interpreting Visual Information Processing in Vision-Language Models

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Reading visually embodied meaning from the brain: Visually grounded computational models decode visual-object mental imagery induced by written text

EVLM: An Efficient Vision-Language Model for Visual Understanding

Modality-Agnostic fMRI Decoding of Vision and Language