How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

Jingfeng Wu,Difan Zou,Zixiang Chen,Vladimir Braverman,Quanquan Gu,Peter L. Bartlett
2024-03-15
Abstract:Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL.
Machine Learning
What problem does this paper attempt to address?
This paper explores the problem of linear regression with Gaussian prior using a pre-trained linear parameterized single-layer linear attention model. The study found that effective pre-training only requires a small number of independent tasks, and the pre-trained model can approach the optimal Bayesian algorithm, which means that under a fixed context length, the pre-trained model can achieve approximately optimal Bayesian risk on unseen tasks. The paper also provides a statistical task complexity bound, and proves that when inferring, the performance of the pre-trained model is comparable to that of optimal parameter tuning ridge regression when the context length is close to the pre-training length. However, when the context length is significantly different, the pre-trained single-layer linear attention model could be suboptimal. The study proposes new techniques to analyze high-order tensors and provides independent interest for similar problem analysis.