Abstract:The ever-growing ecosystem of LLMs has posed a challenge in selecting the most appropriate pre-trained model to fine-tune amidst a sea of options. Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. In this work, we formulate this resource-constrained selection task into predicting fine-tuning performance and illustrate its natural connection with Scaling Law. Unlike pre-training, we find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase". We also explain why existing Scaling Law fails to capture this phase transition phenomenon both theoretically and empirically. To address this, we introduce the concept of "pre-learned data size" into our Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better. By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption, while other methods may provide negatively correlated selection. The project page is available at <a class="link-external link-http" href="http://rectified-scaling-law.github.io" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the problem of how to select the most suitable large-scale language model (LLM) for fine-tuning under resource-constrained conditions. As the ecosystem of large-scale language models grows increasingly vast, choosing the most appropriate model for fine-tuning from numerous pre-trained models becomes highly challenging. Due to resource limitations (such as time, computational power, and storage space), it is impossible to fine-tune all candidate models before making a selection. Additionally, relying on empirical intuition-based selection methods (such as choosing the model with the most parameters or the best zero-shot performance) is also unreliable. Specifically, the paper focuses on the following aspects: 1. **Model Selection under Resource Constraints**: How to efficiently and accurately predict the performance of models after fine-tuning and select the best model under limited resources. 2. **Limitations of Existing Methods**: Most existing model selection methods are suitable for classification and regression tasks, but not for fine-tuning generative language models. 3. **Fine-Tuning Scaling Laws**: Exploring the laws of performance changes with data volume during fine-tuning, particularly discovering a previously unobserved "pre-power phase." ### Solutions To address the above problems, the paper proposes the following solutions: 1. **Rectified Scaling Law**: Introducing the concept of "pre-learned data size" to improve existing scaling laws, making them better fit experimental results. 2. **Novel LLM Selection Algorithm**: Based on the rectified scaling law, designing an algorithm called "Accept then Stop" (AtS), which can select a near-optimal model while reducing resource consumption by hundreds of times. ### Main Contributions 1. **Theoretical Analysis**: Explaining why existing scaling laws fail to capture phase transitions during the fine-tuning process and proposing a rectified scaling law. 2. **Experimental Validation**: Extensively validating the effectiveness of the rectified scaling law and the AtS algorithm through experiments, demonstrating their robustness and accuracy across different datasets and resource constraints. 3. **Practical Application**: Providing new methods and tools for selecting suitable LLMs for fine-tuning in resource-constrained environments, significantly improving the efficiency and accuracy of the selection process. In summary, through theoretical analysis and experimental proof, this paper demonstrates that the proposed methods can efficiently and accurately select the most suitable large-scale language model for fine-tuning under resource-constrained conditions.

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Temporal Scaling Law for Large Language Models

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

Scaling Law for Language Models Training Considering Batch Size

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Scaling Laws for Downstream Task Performance of Large Language Models

Optimization Hyper-parameter Laws for Large Language Models

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

Labeling supervised fine-tuning data with the scaling law

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

Scaling Laws for Multilingual Language Models

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

An Emulator for Fine-Tuning Large Language Models using Small Language Models

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Scaling Laws for Discriminative Classification in Large Language Models

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

A Hitchhiker's Guide to Scaling Law Estimation