Abstract:The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Learning Global Controller in Latent Space for Parameter-Efficient Fine-Tuning

Unveiling the Generalization Power of Fine-Tuned Large Language Models

TAIA: Large Language Models are Out-of-Distribution Data Learners

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Parameter-efficient fine-tuning of large-scale pre-trained language models

Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

Crafting Efficient Fine-Tuning Strategies for Large Language Models

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

HiFi: High-Information Attention Heads Hold for Parameter-Efficient Model Adaptation.

Attending Via Both Fine-tuning and Compressing.

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance

Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

Parameter-efficient Tuning for Large Language Model Without Calculating Its Gradients

Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model