Need a Small Specialized Language Model? Plan Early!

David Grangier,Angelos Katharopoulos,Pierre Ablin,Awni Hannun
2024-10-31
Abstract:Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to train small - sized and highly - efficient language models (SLMs) with good performance under limited professional data and reasoning budgets. Specifically, the paper explores how to obtain professional small - sized language models with good performance when there is a large general pre - training set and a small amount of professional data. The paper considers two scenarios: 1. **Models in a single professional field**: In this case, a model can be pre - trained for each professional task. 2. **Models in multiple professional fields**: In this case, a pre - trained model needs to be quickly adapted to multiple professional fields in order to reduce costs. To address these challenges, the paper proposes two main methods: - **Importance Sampling**: Resample the pre - training set to simulate professional data and train small - sized models on this data. - **Projected Networks (PN)**: This is a large - scale network, and its parameters can be linearly projected into multiple small - sized models, with each model corresponding to a professional field. The paper experimentally verifies the effectiveness of these two methods under different fields, training set sizes, and training budgets, and compares them with the baseline methods. The experimental results show that Importance Sampling performs better in the case of a large amount of professional data, while Projected Networks performs better in the case of a small amount of professional data.