AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

Yifan Zhang,Yifan Luo,Yang Yuan,Andrew Chi-Chih Yao
DOI: https://doi.org/10.48550/arxiv.2402.07625
2024-01-01
Abstract:To improve language models' proficiency in mathematical reasoning viacontinual pretraining, we introduce a novel strategy that leverages baselanguage models for autonomous data selection. Departing from conventionalsupervised fine-tuning or trained classifiers with human-annotated data, ourapproach Autonomous Data Selection (AutoDS) utilizes meta-prompted languagemodels as zero-shot verifiers to evaluate and select high-quality mathematicalcontent autonomously. To demonstrate the efficacy of our method, wecontinuously pretrained a 7B-parameter language model on our curated dataset,achieving substantial improvements in downstream performance on the MATH,GSM8K, and BIG-Bench Hard (BBH) tasks with a token amount reduced by orders ofmagnitude compared to previous continual pretraining works. Our methodshowcases a 2 times increase in pretraining token efficiency compared tostate-of-the-art baselines, underscoring the potential of our approach inenhancing models' mathematical reasoning capabilities. The AutoMathText datasetis available at https://huggingface.co/datasets/math-ai/AutoMathText. The codeis available at https://github.com/yifanzhang-pro/AutoMathText.
What problem does this paper attempt to address?