Small Language Models are Good Too: An Empirical Study of Zero-Shot Classification

Pierre Lepagnol,Thomas Gerald,Sahar Ghannay,Christophe Servan,Sophie Rosset
2024-04-17
Abstract:This study is part of the debate on the efficiency of large versus small language models for text classification by prompting.We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models.Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring functions. Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts.We developed and shared a comprehensive open-source repository that encapsulates our methodologies. This research underscores the notion that bigger isn't always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges.
Artificial Intelligence
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **Do we need to use large - language models (LLMs) to solve text classification problems through the prompting method, or can small - language models achieve similar or even better results?** ### Specific problem decomposition: 1. **Impact of model scale**: The paper explores the relationship between the number of model parameters and zero - shot classification performance. Does a larger model necessarily lead to better classification results? 2. **Impact of architecture selection**: What is the impact of different model architectures (such as encoder - decoder vs. decoder - only) on zero - shot classification performance? 3. **Role of fine - tuning strategies**: Can instruction fine - tuning significantly improve the performance of small models? Does its effect depend on the specific model architecture or dataset? 4. **Choice of scoring functions**: Does the choice of different scoring functions have a significant impact on model performance? ### Main objectives of the paper: By comparing the zero - shot classification performance of language models of different scales, architectures, and fine - tuning strategies on multiple datasets, evaluate the potential of small - language models in this task and challenge the current mainstream view of "the bigger, the better". --- ### Summary of conclusions: 1. **Model scale is not a decisive factor**: On many datasets, there is no significant correlation between model size and performance. Some datasets (such as `cdr`) show a positive correlation, while other datasets (such as `ethos` and `imdb`) show a negative correlation. 2. **Architecture selection is crucial**: For some datasets (such as `agnews`, `bbcnews`, `sms`, etc.), the model architecture has a significant impact on performance. For example, the encoder - decoder architecture may be more suitable for specific tasks. 3. **The effect of instruction fine - tuning varies by dataset**: Instruction fine - tuning significantly improves performance on some datasets (such as `agnews`, `ethos`, `imdb`, etc.), but has an insignificant or even slightly negative effect on other datasets (such as `bbcnews`, `youtube`, `sms`). 4. **Scoring functions have limited impact**: Regardless of the model architecture, the choice of scoring functions has no significant impact on performance. --- ### Significance in scientific research: This paper provides a new perspective for text classification tasks in resource - constrained scenarios, indicating that small - language models can be an effective alternative to large models in some cases. This not only helps to reduce computing costs but also provides more flexibility for practical applications. At the same time, the paper reveals the importance of model architecture and fine - tuning strategies, providing a direction for further optimizing zero - shot classification tasks. If a more detailed interpretation or formula derivation is required, please further clarify!