Abstract:Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.

Average gradient outer product as a mechanism for deep neural collapse

Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal?

Deep Neural Collapse Is Provably Optimal for the Deep Unconstrained Features Model

Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse

Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data

Beyond Unconstrained Features: Neural Collapse for Shallow Neural Networks with General Data

The Prevalence of Neural Collapse in Neural Multivariate Regression

The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features

Generalizing and Decoupling Neural Collapse Via Hyperspherical Uniformity Gap

Neural (Tangent Kernel) Collapse

Neural Collapse in the Intermediate Hidden Layers of Classification Neural Networks

Limitations of Neural Collapse for Understanding Generalization in Deep Learning

Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Perturbation Analysis of Neural Collapse

Prevalence of Neural Collapse during the terminal phase of deep learning training

The Exploration of Neural Collapse under Imbalanced Data

Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay

An Unconstrained Layer-Peeled Perspective on Neural Collapse

Subdomain contraction in deep networks for robust representation learning

A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks