What problem does this paper attempt to address?

The problem that this paper attempts to solve is: re - examine whether small - scale language models based on the Transformer architecture can learn linear functions through in - context learning (ICL) under different training and testing settings. Specifically, the author explores the performance of these models when performing ICL, especially how they handle new data outside the training data distribution, and reveals some limitations of existing methods. ### Main problems 1. **Limitations of in - context learning**: - Existing research shows that some Transformer models can learn linear functions in context, but these studies usually focus on simple tasks and small - scale models. The author of this paper hopes to gain a deeper understanding of whether these models can effectively perform ICL in a broader range of scenarios. 2. **Insufficient generalization ability**: - The author finds that although some models perform well on data with a specific distribution, their performance drops significantly when facing data outside the training distribution. For example, all the Transformer models tested are unable to correctly learn strictly monotonically increasing or decreasing linear functions, especially in larger intervals. 3. **Boundary value phenomenon**: - There is a "boundary value" phenomenon in the model's prediction, that is, when the input value exceeds a certain range, the model's prediction performance drops sharply. This indicates that the model has not really learned the linear regression algorithm but relies on the projection of similar sequences in the training data. 4. **Importance of the attention mechanism**: - The research also explores the role of the attention mechanism in ICL. The results show that at least two attention layers are required to achieve effective ICL, and more attention heads help improve performance. ### Formula and experimental design To evaluate the performance of the model, the author uses the following formula to define the autoregressive objective function: \[ \theta^* = \arg \min_{\theta} \mathbb{E}_{x_i \in D_I, f \in D_F} \left[ \sum_{i = 0}^{k} l(f(x_{i + 1}), L_\theta((x_1, f(x_1),..., f(x_i), x_{i + 1}))) \right] \] where: - \( L_\theta \) is the learner, - \( l(y,\hat{y})=\| y - \hat{y} \|^2 \) is the squared - error loss function, - \( f(x)=ax + b \) is a linear function, and \( a \) and \( b \) are randomly selected according to the training distribution. ### Conclusions Through the analysis of experimental results under different distributions, the author draws the following conclusions: - The model has not really achieved linear regression but makes projection adjustments based on the training data. - The attention mechanism is crucial for ICL, but even models with multi - layer attention mechanisms show obvious limitations when facing data outside the training distribution. - There is a "boundary value" phenomenon, indicating that the model's learning method depends on the data it has seen rather than real mathematical calculations. In general, this paper reveals the limitations of current Transformer models in in - context learning and provides an important reference direction for future research.

Re-examining learning linear functions in context

Why Larger Language Models Do In-context Learning Differently?

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

In-Context Language Learning: Architectures and Algorithms

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

LLMs Are In-Context Reinforcement Learners

In-Context Learning Functions with Varying Number of Minima

Context-Scaling versus Task-Scaling in In-Context Learning

"In-Context Learning" or: How I learned to stop worrying and love "Applied Information Retrieval"

Do pretrained Transformers Learn In-Context by Gradient Descent?

Trained Transformers Learn Linear Models In-Context

Does learning the right latent variables necessarily improve in-context learning?

Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Can In-context Learning Really Generalize to Out-of-distribution Tasks?

Provable In-Context Learning of Linear Systems and Linear Elliptic PDEs with Transformers

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers

The Developmental Landscape of In-Context Learning