Big Data and Large Numbers. Interpreting Zipf's Law

Horia-Nicolai L. Teodorescu
2023-05-08
Abstract:It turns out that some empirical facts in Big Data are the effects of properties of large numbers. Zipf's law 'noise' is an example of such an artefact. We expose several properties of the power law distributions and of similar distribution that occur when the population is finite and the rank and counts of elements in the population are natural numbers. We are particularly concerned with the low-rank end of the graph of the law, the potential of noise in the law, and with the approximation of the number of types of objects at various ranks. Approximations instead of exact solutions are the center of attention. Consequences in the interpretation of Zipf's law are discussed.
Physics and Society,Computation and Language,Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that some empirical facts in big data can be interpreted as the result of a large number of numerical properties, especially the "noise" phenomenon in Zipf's law. The author explores several properties of power - law distributions and their similar distributions when the population is finite and the rank and number of elements are natural numbers. The paper focuses particularly on the low - ranking end of the power - law graph, the potential noise in the law, and the approximation of the number of object types at different ranks. The author points out that the approximate solution rather than the exact solution is the focus of the study and discusses the impact of these findings on the understanding of Zipf's law. Specifically, by analyzing the characteristics of power - law distributions (such as Zipf's law) when dealing with discrete variables, the paper explores the following aspects: 1. **Properties of power - law distributions**: In particular, the characteristics exhibited by power - law distributions when dealing with natural number rankings and counts. 2. **Noise at the low - ranking end**: It explores why noise occurs at the low - ranking end of the power - law graph and whether this noise really reflects the uncertainty of the data. 3. **Approximation of the number of object types**: How to estimate the number of object types at different ranks. 4. **The impact of merging two power - law distribution populations**: It studies the changes in the Zipf's law graph when merging two power - law distribution populations with the same or different exponents. 5. **Noise in power - law distributions**: It discusses how to introduce noise in power - law distributions and the impact of this noise on rankings and counts. 6. **Hapax Legomena and related indicators**: It explores the significance and limitations of Hapax Legomena (words that appear only once) and Honoré, Sichel and other indicators in text analysis. Overall, the paper aims to gain a deep understanding of the performance of power - law distributions in big data through mathematical and statistical methods, especially the effectiveness and limitations of Zipf's law in practical applications.