Abstract:Neural collapse ($\mathcal{NC}$) is a phenomenon observed in classification tasks where top-layer representations collapse into their class means, which become equinorm, equiangular and aligned with the classifiers. These behaviors -- associated with generalization and robustness -- would manifest under specific conditions: models are trained towards zero loss, with noise-free labels belonging to balanced classes, which do not outnumber the model's hidden dimension. Recent studies have explored $\mathcal{NC}$ in the absence of one or more of these conditions to extend and capitalize on the associated benefits of ideal geometries. Language modeling presents a curious frontier, as \textit{training by token prediction} constitutes a classification task where none of the conditions exist: the vocabulary is imbalanced and exceeds the embedding dimension; different tokens might correspond to similar contextual embeddings; and large language models (LLMs) in particular are typically only trained for a few epochs. This paper empirically investigates the impact of scaling the architectures and training of causal language models (CLMs) on their progression towards $\mathcal{NC}$. We find that $\mathcal{NC}$ properties that develop with scaling are linked to generalization. Moreover, there is evidence of some relationship between $\mathcal{NC}$ and generalization independent of scale. Our work therefore underscores the generality of $\mathcal{NC}$ as it extends to the novel and more challenging setting of language modeling. Downstream, we seek to inspire further research on the phenomenon to deepen our understanding of LLMs -- and neural networks at large -- and improve existing architectures based on $\mathcal{NC}$-related properties.

Analysing Dropout and Compounding Errors in Neural Language Models

Wordreg: Mitigating the Gap Between Training and Inference with Worst-Case Drop Regularization

Investigating the Synergistic Effects of Dropout and Residual Connections on Language Model Training

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Visualizing and Understanding Neural Models in NLP

Layer-wise Regularized Dropout for Neural Language Models

R-Drop: Regularized Dropout for Neural Networks.

Probing the Structure and Functional Properties of the Dropout-Induced Correlated Variability in Convolutional Neural Networks

AutoDropout: Learning Dropout Patterns to Regularize Deep Networks

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Exploring the Limits of Language Modeling

Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter

Randomness Regularization with Simple Consistency Training for Neural Networks

Deep Residual Output Layers for Neural Language Generation

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables

Beyond Error Propagation: Language Branching Also Affects the Accuracy of Sequence Generation

Towards Understanding and Improving Dropout in Game Theory

Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation

Linguistic Collapse: Neural Collapse in (Large) Language Models