InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Xiaotian Han,Yiren Jian,Xuefeng Hu,Haogeng Liu,Yiqi Wang,Qihang Fan,Yuang Ai,Huaibo Huang,Ran He,Zhenheng Yang,Quanzeng You
2024-09-19
Abstract:Pre-training on large-scale, high-quality datasets is crucial for enhancing the reasoning capabilities of Large Language Models (LLMs), especially in specialized domains such as mathematics. Despite the recognized importance, the Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source pre-training dataset specifically designed for mathematical reasoning. To address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. We provide a detailed overview of our data collection and processing pipeline. To demonstrate the robustness of InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal settings. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses 120 billion tokens for the same model size. Nevertheless, with the introduction of our multi-modal math pre-training dataset, our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math. We release our data at <a class="link-external link-https" href="https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that current multimodal large language models (MLLMs) lack high - quality, large - scale open - source pre - training datasets in terms of mathematical reasoning. Although large enterprises may possess proprietary large - scale pre - training datasets, these datasets are not made public, resulting in the open - source community facing the problem of insufficient data when developing MLLMs with strong mathematical reasoning capabilities. Specifically, the paper mentions: 1. **Enhancement of Mathematical Reasoning Ability**: Large - scale, high - quality datasets are crucial for enhancing the reasoning ability of large language models (LLMs), especially multimodal large language models (MLLMs), particularly in the field of mathematics. 2. **Importance of Multimodal Data**: Mathematical knowledge exists not only in text form but also includes visual elements such as charts and geometric diagrams. Most of the existing open - source datasets only contain text data and fail to fully utilize these visual elements to enhance the model's mathematical reasoning ability. 3. **Limitations of Existing Datasets**: Although there are some proprietary large - scale mathematical pre - training datasets, they are not public, which restricts the research and development of the open - source community. In addition, the existing open - source datasets cannot meet the requirements in terms of scale and quality. To solve these problems, the author introduced InfiMM - WebMath - 40B, which is a publicly available large - scale multimodal mathematical pre - training dataset. This dataset contains 24 million web documents, 85 million image URLs, and approximately 40 billion text tokens, aiming to fill the gap in multimodal mathematical data in the open - source community and promote the progress of multimodal large language models in the field of mathematical reasoning. By constructing and releasing InfiMM - WebMath - 40B, the author hopes to provide a powerful basic resource for the open - source community, promote more research and development work, and further enhance the ability of multimodal large language models in mathematical reasoning.