Abstract:By the method of Poissonization we confirm some existing results concerning consistent estimation of the structural distribution function in the situation of a large number of rare events. Inconsistency of the so called natural estimator is proved. The method of grouping in cells of equal size is investigated and its consistency derived. A bound on the mean squared error is derived.
What problem does this paper attempt to address?
This paper attempts to solve the problem of estimating the Structural Distribution Function in the case of a large number of rare events. Specifically, the authors focus on the estimation problem under the following conditions:
\[ n, M \to \infty \quad \text{and} \quad \frac{n}{M} \to \lambda, \quad \text{where} \quad 0 < \lambda < \infty. \]
Here, \( n \) represents the sample size (for example, the number of words in a text), \( M \) represents the number of categories (for example, the size of the vocabulary), and \( \lambda \) is a finite positive number. This setting in the context of linguistics means that both the text and the vocabulary are very large, and the size of the text is proportional to the size of the vocabulary.
### Main problems
1. **Inconsistency of the natural estimator**:
- The paper first proves that the so - called "Natural Estimator" is inconsistent. The natural estimator is estimated based on the frequency of each category, but under the above conditions, it cannot correctly converge to the true Structural Distribution Function \( F \).
- Specifically, the natural estimator \( \hat{F}_M(x) \) converges to a distribution function \( F_{Y/\lambda}(x) \) different from \( F(x) \), where the conditional distribution of \( Y \) given \( Z = z \) is the Poisson distribution \( \text{Poisson}(\lambda z) \).
2. **Consistency of the grouping method**:
- To overcome the inconsistency of the natural estimator, the authors introduce a grouping method. Divide \( M \) categories into \( m \) groups, with each group containing \( k \) categories (i.e., \( M = km \)). In this way, the number of categories can be reduced, thereby improving the consistency of the estimation.
- The grouped estimator \( \hat{F}_m(x) \) is proven to be consistent, that is, under appropriate conditions, it can converge to the true Structural Distribution Function \( F(x) \).
3. **Bounds of the mean - squared error**:
- The paper also derives the upper bounds of the Mean Squared Error (MSE) of the grouped estimator. These upper bounds depend on the number of groups \( m \), and by choosing an appropriate \( m \), the mean - squared error can be minimized.
- Specifically, when \( m \gg n^{1/3} \), the upper bound of the mean - squared error is:
\[
\text{MSE}(\hat{F}_m(x)) \leq \frac{9}{4\pi^2} \left( \frac{24\tau}{6\pi^3} \right)^{2/5} n^{-2/5} + o(n^{-2/5}),
\]
When \( m \ll n^{1/3} \), the upper bound of the mean - squared error is:
\[
\text{MSE}(\hat{F}_m(x)) \leq \frac{1}{4m} + o\left( \frac{1}{m} \right).
\]
### Summary
The main contributions of this paper are:
- Proving the inconsistency of the natural estimator;
- Proposing a grouping method and proving its consistency;
- Deriving the upper bounds of the mean - squared error of the grouped estimator and giving the optimal choice of the number of groups.
These results are of great significance for dealing with cases of a large number of rare events, especially in fields such as linguistics, and can help to estimate the Structural Distribution Function more accurately.