Abstract:The formation of sentences is a highly structured and history-dependent process. The probability of using a specific word in a sentence strongly depends on the 'history' of word-usage earlier in that sentence. We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average. We first show that the model explains the approximate Zipf law found in word frequencies as a direct consequence of sample-space reduction. We then empirically quantify the amount of sample-space reduction in the sentences of ten famous English books, by analysis of corresponding word-transition tables that capture which words can follow any given word in a text. We find a highly nested structure in these transition tables and show that this `nestedness' is tightly related to the power law exponents of the observed word frequency distributions. With the proposed model it is possible to understand that the nestedness of a text can be the origin of the actual scaling exponent, and that deviations from the exact Zipf law can be understood by variations of the degree of nestedness on a book-by-book basis. On a theoretical level we are able to show that in case of weak nesting, Zipf's law breaks down in a fast transition. Unlike previous attempts to understand Zipf's law in language the sample-space reducing model is not based on assumptions of multiplicative, preferential, or self-organised critical mechanisms behind language formation, but simply used the empirically quantifiable parameter 'nestedness' to understand the statistics of word frequencies.

Exploring Regularity in Source Code: Software Science and Zipf's Law

Discovering power laws in computer programs

Zipf'S Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems

Deviation of Zipf's and Heaps' Laws in Human Languages with Limited Dictionary Sizes

Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages

Scaling Laws in Human Language

Software Libraries and Their Reuse: Entropy, Kolmogorov Complexity, and Zipf's Law

Beyond Zipf's law: Modeling the structure of human language

An Empirical Study of Class Sizes for Large Java Systems

The common patterns of abundance: the log series and Zipf's law

Optimal coding and the origins of Zipfian laws

A scaling law beyond Zipf's law and its relation to Heaps' law

Large-Scale Analysis of Zipf’s Law in English Texts

N-tuple Zipf Analysis and Modeling for Language, Computer Program and DNA

The Scale-Free Feature and Evolving Model of Large-Scale Software Systems

Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation

Compression and the origins of Zipf's law of abbreviation

How scale affects structure in Java programs

Compression and the origins of Zipf's law for word frequencies

The Complexity Nature of Large-Scale Software Systems

An Empirical Investigation into a Large-Scale Java Open Source Code Repository