Are the BERT Family Zero-Shot Learners? A Study on Their Potential and Limitations.
Yue Wang,Lijun Wu,Juntao Li,Xiaobo Liang,Min Zhang
DOI: https://doi.org/10.1016/j.artint.2023.103953
IF: 14.4
2023-01-01
Artificial Intelligence
Abstract:Starting from the resurgence of deep learning, language models (LMs) have never been so popular. Through simply increasing model scale and data size, large LMs pre-trained with self-supervision objectives demonstrate awe-inspiring results on both task performance and generalization. At the early stage, supervised fine-tuning is indispensable in adapting pre-trained language models (PLMs) to downstream tasks. Later on, the sustained growth of model capacity and data size, as well as newly presented pre-training techniques, make the PLMs perform well under the few-shot setting, especially in the recent paradigm of prompt-based learning. After witnessing the success of PLMs for few-shot tasks, we propose to further study the potential and limitations of PLMs for the zero-shot setting. We utilize 3 models from the most popular BERT family to launch the empirical study on 20 different datasets. We are surprised to find that some simple strategies (without the need of human efforts or unsupervised data) can yield very promising results on a few widely-used datasets, e.g., 88.34%(±0.60) accuracy on the IMDB dataset, and 84.88%(±2.83) accuracy on the Amazon dataset, which outperforms manually created prompts without engineering in achieving much better and stable performance with the accuracy of 74.06%(±13.04), 75.54%(±11.77) for comparison. However, we also observe some limitations of PLMs under the zero-shot setting, particularly for the language understanding tasks (e.g., GLUE, SuperGLUE).2