Abstract:The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.

SelectIT: Selective Instruction Tuning for Large Language Models Via Uncertainty-Aware Self-Reflection

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Instruction Tuning for Large Language Models: A Survey

Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Boosting LLM via Learning from Data Iteratively and Selectively

Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection

Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks

AlpaGasus: Training A Better Alpaca with Fewer Data

R-Tuning: Instructing Large Language Models to Say `I Don't Know'

SelectLLM: Can LLMs Select Important Instructions to Annotate?

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

R-tuning: Teaching large language models to refuse unknown questions

TAIA: Large Language Models are Out-of-Distribution Data Learners

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?