Abstract:Pre-training on large-scale, high-quality datasets is crucial for enhancing the reasoning capabilities of Large Language Models (LLMs), especially in specialized domains such as mathematics. Despite the recognized importance, the Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source pre-training dataset specifically designed for mathematical reasoning. To address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. We provide a detailed overview of our data collection and processing pipeline. To demonstrate the robustness of InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal settings. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses 120 billion tokens for the same model size. Nevertheless, with the introduction of our multi-modal math pre-training dataset, our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math. We release our data at <a class="link-external link-https" href="https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that current multimodal large language models (MLLMs) lack high - quality, large - scale open - source pre - training datasets in terms of mathematical reasoning. Although large enterprises may possess proprietary large - scale pre - training datasets, these datasets are not made public, resulting in the open - source community facing the problem of insufficient data when developing MLLMs with strong mathematical reasoning capabilities. Specifically, the paper mentions: 1. **Enhancement of Mathematical Reasoning Ability**: Large - scale, high - quality datasets are crucial for enhancing the reasoning ability of large language models (LLMs), especially multimodal large language models (MLLMs), particularly in the field of mathematics. 2. **Importance of Multimodal Data**: Mathematical knowledge exists not only in text form but also includes visual elements such as charts and geometric diagrams. Most of the existing open - source datasets only contain text data and fail to fully utilize these visual elements to enhance the model's mathematical reasoning ability. 3. **Limitations of Existing Datasets**: Although there are some proprietary large - scale mathematical pre - training datasets, they are not public, which restricts the research and development of the open - source community. In addition, the existing open - source datasets cannot meet the requirements in terms of scale and quality. To solve these problems, the author introduced InfiMM - WebMath - 40B, which is a publicly available large - scale multimodal mathematical pre - training dataset. This dataset contains 24 million web documents, 85 million image URLs, and approximately 40 billion text tokens, aiming to fill the gap in multimodal mathematical data in the open - source community and promote the progress of multimodal large language models in the field of mathematical reasoning. By constructing and releasing InfiMM - WebMath - 40B, the author hopes to provide a powerful basic resource for the open - source community, promote more research and development work, and further enhance the ability of multimodal large language models in mathematical reasoning.

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning