Abstract:Human language is a unique form of communication in the natural world, distinguished by its structured nature. Most fundamentally, it is systematic, meaning that signals can be broken down into component parts that are individually meaningful -- roughly, words -- which are combined in a regular way to form sentences. Furthermore, the way in which these parts are combined maintains a kind of locality: words are usually concatenated together, and they form contiguous phrases, keeping related parts of sentences close to each other. We address the challenge of understanding how these basic properties of language arise from broader principles of efficient communication under information processing constraints. Here we show that natural-language-like systematicity arises in codes that are constrained by predictive information, a measure of the amount of information that must be extracted from the past of a sequence in order to predict its future. In simulations, we show that such codes approximately factorize their source distributions, and then express the resulting factors systematically and locally. Next, in a series of cross-linguistic corpus studies, we show that human languages are structured to have low predictive information at the levels of phonology, morphology, syntax, and semantics. Our result suggests that human language performs a sequential, discrete form of Independent Components Analysis on the statistical distribution over meanings that need to be expressed. It establishes a link between the statistical and algebraic structure of human language, and reinforces the idea that the structure of human language is shaped by communication under cognitive constraints.

Language Design as Information Renormalization

Rule-Based and Word-Level Statistics-Based Processing of Language: Insights from Neuroscience

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

Language is Physical

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Representations as Language: An Information-Theoretic Framework for Interpretability

Meta-Designing Quantum Experiments with Language Models

Linguistic Structure from a Bottleneck on Sequential Information Processing

Reranking Laws for Language Generation: A Communication-Theoretic Perspective

A Mathematical Model for Linguistic Universals.

Universal Complex Structures in Written Language

Language Model Evaluation Beyond Perplexity

Mathematical Structure of Syntactic Merge

Multilinear Grammar: Ranks and Interpretations

Beyond Zipf's law: Modeling the structure of human language

Vectoring Languages

Natural language syntax complies with the free-energy principle

A Random Matrix Approach to Language Acquisition

Quantum Physics and Human Language

Quantization Games on Social Networks and Language Evolution

Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution