OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu,Biao Yang,Zhebin Kuang,Jiajun Song,Yuzhe Li,Linghao Zhu,Qidi Luo,Xinyu Wang,Hao Lu,Mingxin Huang,Zhang Li,Guozhi Tang,Bin Shan,Chunhui Lin,Qi Liu,Binghong Wu,Hao Feng,Hao Liu,Can Huang,Jingqun Tang,Wei Chen,Lianwen Jin,Yuliang Liu,Xiang Bai
2024-12-31
Abstract:Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at <a class="link-external link-https" href="https://github.com/Yuliang-liu/MultimodalOCR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing Optical Character Recognition (OCR) benchmarks have failed to fully evaluate the capabilities of large - scale multimodal models (LMMs) in handling complex text tasks, especially in challenging tasks such as text localization, handwritten content extraction, and logical reasoning. Specifically: 1. **Limitations of Existing Benchmarks**: - Although existing benchmark tests have demonstrated the excellent performance of LMMs in text recognition, their capabilities in certain complex tasks such as text localization, handwritten content extraction, and logical reasoning have not been fully explored. - Many existing text - centered datasets were originally designed for classical OCR models and lack diversity, depth, and the ability to evaluate LMMs. 2. **Need for a New Benchmark**: - A more comprehensive benchmark is required to evaluate the OCR capabilities of LMMs in multiple environments, including but not limited to scientific documents, natural scenes, and other scenarios. - The new benchmark should cover more diverse task types, provide more complex contexts, and have a larger amount of data to ensure a comprehensive evaluation of LMMs. To solve these problems, the paper proposes **OCRBench v2**, an improved benchmarking platform aimed at evaluating the performance of LMMs in visual text localization and reasoning. OCRBench v2 has the following characteristics: - **Diverse Tasks**: It contains four times the number of tasks in previous multi - scene benchmarks and covers 23 specific subtasks. - **Wide - ranging Scenarios**: It covers 31 different scenarios, including street scenes, receipts, formulas, charts, etc. - **High - Quality Annotations**: It provides 10,000 manually verified question - answer pairs. - **Comprehensive Evaluation Metrics**: It introduces six types of evaluation metrics to ensure a strict evaluation of the performance of LMMs. Through these improvements, OCRBench v2 aims to fill the gaps in existing benchmarks, reveal the possible limitations of LMMs in practical applications, and provide a more comprehensive evaluation framework for future research.