Autonomous Data Selection with Language Models for Mathematical Texts

Yifan Zhang,Yifan Luo,Yang Yuan,Andrew Chi-Chih Yao
2024-10-29
Abstract:To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter language model on our curated dataset, achieving substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks with a token amount reduced by orders of magnitude compared to previous continual pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-the-art baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at <a class="link-external link-https" href="https://huggingface.co/datasets/math-ai/AutoMathText" rel="external noopener nofollow">this https URL</a>. The code is available at <a class="link-external link-https" href="https://github.com/yifanzhang-pro/AutoMathText" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?