Abstract:The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

On the (In)Effectiveness of Large Language Models for Chinese Text Correction

Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review

Fine-tuning Large Language Models for Entity Matching

Fine-Tuning Large Language Models in Education

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Fine-grained LLM Agent: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

Fine-tuning Large Language Models for Domain-specific Machine Translation

Evaluating LLMs' grammatical error correction performance in learner Chinese

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Unveiling the Generalization Power of Fine-Tuned Large Language Models

LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

Fine-Tuning Medical Language Models for Enhanced Long-Contextual Understanding and Domain Expertise

The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

On Learning to Summarize with Large Language Models as References

Fine-Tuning LLaMA for Multi-Stage Text Retrieval