Need a Small Specialized Language Model? Plan Early!

David Grangier,Angelos Katharopoulos,Pierre Ablin,Awni Hannun

2024-10-31

Abstract:Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to train small - sized and highly - efficient language models (SLMs) with good performance under limited professional data and reasoning budgets. Specifically, the paper explores how to obtain professional small - sized language models with good performance when there is a large general pre - training set and a small amount of professional data. The paper considers two scenarios: 1. **Models in a single professional field**: In this case, a model can be pre - trained for each professional task. 2. **Models in multiple professional fields**: In this case, a pre - trained model needs to be quickly adapted to multiple professional fields in order to reduce costs. To address these challenges, the paper proposes two main methods: - **Importance Sampling**: Resample the pre - training set to simulate professional data and train small - sized models on this data. - **Projected Networks (PN)**: This is a large - scale network, and its parameters can be linearly projected into multiple small - sized models, with each model corresponding to a professional field. The paper experimentally verifies the effectiveness of these two methods under different fields, training set sizes, and training budgets, and compares them with the baseline methods. The experimental results show that Importance Sampling performs better in the case of a large amount of professional data, while Projected Networks performs better in the case of a small amount of professional data.

Need a Small Specialized Language Model? Plan Early!

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?

Self-Specialization: Uncovering Latent Expertise within Large Language Models

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Large Language Model Compression with Neural Architecture Search

A Compact Pretraining Approach for Neural Language Models

Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

Search for Efficient Large Language Models

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Small Pre-trained Language Models Can Be Fine-tuned As Large Models Via Over-Parameterization.

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Target-Aware Language Modeling via Granular Data Sampling

The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models

Tending Towards Stability: Convergence Challenges in Small Language Models

No Need to Talk: Asynchronous Mixture of Language Models

Large language model programs

Training Bilingual LMs with Data Constraints in the Targeted Language

Large Model Strategic Thinking, Small Model Efficiency: Transferring Theory of Mind in Large Language Models