Phi-4 Technical Report

Marah Abdin,Jyoti Aneja,Harkirat Behl,Sébastien Bubeck,Ronen Eldan,Suriya Gunasekar,Michael Harrison,Russell J. Hewett,Mojan Javaheripi,Piero Kauffmann,James R. Lee,Yin Tat Lee,Yuanzhi Li,Weishung Liu,Caio C. T. Mendes,Anh Nguyen,Eric Price,Gustavo de Rosa,Olli Saarikivi,Adil Salim,Shital Shah,Xin Wang,Rachel Ward,Yue Wu,Dingli Yu,Cyril Zhang,Yi Zhang
2024-12-12
Abstract:We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of small - scale language models in reasoning and problem - solving tasks by improving data quality and training methods. Specifically, the paper introduces the phi - 4 model, a 14 - billion - parameter language model, in whose development process the importance of data quality is particularly emphasized. Unlike most language models that rely on natural data sources (such as web content or code) for pre - training, phi - 4 strategically introduces a large amount of synthetic data during the training process. These synthetic data are not only used for pre - training but also play a crucial role in the mid - training and post - training stages. By optimizing the training curriculum and data mixing, and introducing new post - training techniques, the model's performance in STEM - related Q&A capabilities is significantly improved, even surpassing its teacher model GPT - 4. ### Main problems to be solved 1. **Improve reasoning and problem - solving abilities**: By using high - quality synthetic data, phi - 4 performs excellently in reasoning and problem - solving tasks, especially in STEM - related Q&A capabilities. 2. **Reduce dependence on large - scale models**: Although the model is relatively small (14 billion parameters), phi - 4 outperforms larger - scale models in multiple benchmark tests, demonstrating the importance of data quality. 3. **Prevent over - fitting and data pollution**: The paper elaborates on how to prevent the model from over - fitting on specific benchmark tests through data de - pollution and the use of fresh data sets. ### Specific measures 1. **Generation and use of synthetic data**: - **Multi - agent prompting**: Generate diverse synthetic data through multi - agent conversations. - **Self - revision workflow**: The model self - assesses and improves the generated content. - **Instruction reversal**: Convert existing code snippets into instructions to enhance the model's instruction - understanding ability. 2. **Screening and filtering of organic data**: - Carefully screen and filter organic data from the web, books, and code repositories to extract content with high complexity, reasoning depth, and educational value. - Use a multi - stage filtering process to ensure data quality. 3. **Post - training techniques**: - **Supervised fine - tuning (SFT)**: Use carefully curated user prompts to generate multiple model responses and select the best response. - **Direct preference optimization (DPO)**: Generate DPO pairs based on rejection sampling and LLM evaluation, partly based on the key - token search method. ### Experimental results - **Benchmark test performance**: phi - 4 performs excellently in multiple standard benchmark tests, especially in reasoning and problem - solving tasks. For example, in the GPQA (graduate - level STEM Q&A) and MATH (mathematics competition) benchmark tests, the performance of phi - 4 significantly surpasses that of its teacher model GPT - 4. - **Fresh data set test**: In the AMC - 10 and AMC - 12 mathematics competitions in November 2024, phi - 4 outperforms many larger - scale models, further validating its robustness and generalization ability on fresh data sets. Through these innovations and techniques, phi - 4 not only outperforms its teacher model in performance but also remains competitive in terms of cost and latency, demonstrating the importance of data quality in language model development.