AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

Kun Xiang,Zhili Liu,Zihao Jiang,Yunshuang Nie,Runhui Huang,Haoxiang Fan,Hanhui Li,Weiran Huang,Yihan Zeng,Jianhua Han,Lanqing Hong,Hang Xu,Xiaodan Liang
2024-11-22
Abstract:In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of ``slow thinking" into multimodal large language models (MLLMs). Contrary to existing methods that rely on direct or fast thinking, our key idea is to construct long chains of thought (CoT) consisting of atomic actions in a step-by-step manner, guiding MLLMs to perform complex reasoning. To this end, we design a novel AtomThink framework composed of three key modules: (i) a CoT annotation engine that automatically generates high-quality CoT annotations to address the lack of high-quality visual mathematical data; (ii) an atomic step fine-tuning strategy that jointly optimizes an MLLM and a policy reward model (PRM) for step-wise reasoning; and (iii) four different search strategies that can be applied with the PRM to complete reasoning. Additionally, we propose AtomMATH, a large-scale multimodal dataset of long CoTs, and an atomic capability evaluation metric for mathematical tasks. Extensive experimental results show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving approximately 50\% relative accuracy gains on MathVista and 120\% on MathVerse. To support the advancement of multimodal slow-thinking models, we will make our code and dataset publicly available on <a class="link-external link-https" href="https://github.com/Quinn777/AtomThink" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve complex tasks in multi - modal mathematical reasoning, especially how to incorporate "slow - thinking" ability into multi - modal large language models (MLLMs). Existing methods usually rely on direct or fast reasoning methods, which have limitations when dealing with complex problems. Therefore, this paper proposes a new framework - AtomThink, aiming to guide MLLMs to perform complex reasoning by constructing long - chain thinking (CoT), that is, performing atomic operations step by step. ### Specific problem description 1. **Limitations of existing methods**: - When dealing with complex reasoning tasks, existing multi - modal large language models usually rely on direct prediction or greedy decoding strategies, and these methods cannot fully capture the details in the reasoning process. - Although chain - of - thought (CoT) has been introduced to enhance the reasoning ability of the model, these methods still lack attention to the quality of each intermediate step, resulting in an inaccurate and unreliable reasoning process. 2. **Advantages of slow - thinking**: - The slow - thinking method significantly improves the performance of artificial intelligence systems in solving complex problems by extending the reasoning chain and adopting test - time scaling technology. - This method allows the model to reason step by step, ensuring the quality of each step, thereby improving the overall problem - solving ability. 3. **Challenges**: - In the multi - modal field, information modeling of visual tasks requires more data and computing resources, which poses a challenge to the application of slow - thinking techniques. - Although existing methods attempt to stimulate the CoT ability of MLLMs through means such as prompt engineering, they do not fully consider the quality of each reasoning step. ### Paper solutions To solve the above problems, the paper proposes the AtomThink framework, which mainly includes the following aspects: 1. **Multi - modal CoT annotation engine**: - By automatically generating high - quality long - chain thinking data, it solves the problem of data scarcity in multi - modal mathematical reasoning. - Introduce dynamic prompting strategies and semantic - level enhancement strategies to generate multi - step reasoning paths. 2. **Atomic - step fine - tuning strategy**: - Perform atomic - level fine - tuning on MLLMs to ensure the quality of each reasoning step. - Use the policy reward model (PRM) to optimize decisions in the reasoning process. 3. **Four different search strategies**: - Include path - search and step - search strategies to find the optimal reasoning path. - For example, methods such as the greedy algorithm and beam search ensure the accuracy and consistency of the reasoning process. 4. **Atomic - ability evaluation metric**: - Propose an atomic - ability evaluation method based on result supervision to analyze the performance of the model in different reasoning steps in a fine - grained manner. ### Experimental results Through experimental verification on multiple public datasets, AtomThink significantly improves the performance of the baseline MLLMs, achieving approximately 50% and 120% relative accuracy improvements on the MathVista and MathVerse datasets respectively. ### Summary The main contributions of the paper are: - Propose a comprehensive framework, AtomThink, which significantly improves the performance of multi - modal large language models through atomic - step reasoning. - Design an atomic - ability evaluation method based on result supervision, revealing the ability distribution of MLLMs in generating different types of atomic steps. - Construct a large - scale multi - modal long - CoT dataset, AtomMATH, which focuses on multi - modal mathematical tasks. Through these improvements, AtomThink not only enhances the reasoning ability of the model but also provides valuable data and method support for future research.