Abstract:Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at <a class="link-external link-https" href="https://github.com/Yuliang-liu/MultimodalOCR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing Optical Character Recognition (OCR) benchmarks have failed to fully evaluate the capabilities of large - scale multimodal models (LMMs) in handling complex text tasks, especially in challenging tasks such as text localization, handwritten content extraction, and logical reasoning. Specifically: 1. **Limitations of Existing Benchmarks**: - Although existing benchmark tests have demonstrated the excellent performance of LMMs in text recognition, their capabilities in certain complex tasks such as text localization, handwritten content extraction, and logical reasoning have not been fully explored. - Many existing text - centered datasets were originally designed for classical OCR models and lack diversity, depth, and the ability to evaluate LMMs. 2. **Need for a New Benchmark**: - A more comprehensive benchmark is required to evaluate the OCR capabilities of LMMs in multiple environments, including but not limited to scientific documents, natural scenes, and other scenarios. - The new benchmark should cover more diverse task types, provide more complex contexts, and have a larger amount of data to ensure a comprehensive evaluation of LMMs. To solve these problems, the paper proposes **OCRBench v2**, an improved benchmarking platform aimed at evaluating the performance of LMMs in visual text localization and reasoning. OCRBench v2 has the following characteristics: - **Diverse Tasks**: It contains four times the number of tasks in previous multi - scene benchmarks and covers 23 specific subtasks. - **Wide - ranging Scenarios**: It covers 31 different scenarios, including street scenes, receipts, formulas, charts, etc. - **High - Quality Annotations**: It provides 10,000 manually verified question - answer pairs. - **Comprehensive Evaluation Metrics**: It introduces six types of evaluation metrics to ensure a strict evaluation of the performance of LMMs. Through these improvements, OCRBench v2 aims to fill the gaps in existing benchmarks, reveal the possible limitations of LMMs in practical applications, and provide a more comprehensive evaluation framework for future research.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

OmniBench: Towards The Future of Universal Omni-Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

A Survey on Benchmarks of Multimodal Large Language Models

MMR: Evaluating Reading Ability of Large Multimodal Models

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

MMBench: Is Your Multi-modal Model an All-around Player?

On the Hidden Mystery of OCR in Large Multimodal Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models