Fixed points fix estimates: The accuracy of an effect size estimate is described by its sample’s conditional algorithmic information, as computed across rank permutations
Michael Cader Nelson
DOI: https://doi.org/10.31234/osf.io/wqnbu
2021-11-23
Abstract:Every statistical estimate is equal to the sum of a nonrandom component, due to parameter values and bias, and a random component, due to sampling error. Estimation theory suggests that the two components are hopelessly confounded in the estimate. We would like to estimate the sign and magnitude of a statistic’s random deviation from its parameter--its accuracy--in the same way we quantify a statistic’s random variability around its parameter--its precision--by estimating the standard error. However, because the random component is an attribute of the sample data, it be described with parametric or Fisher information. In information theory, on the other hand, every information type--entropy, complexity--is understood as describing the extent of randomness in manifest data. This suggests that integrating the two conceptions of information could allow us to describe the two components of a statistical estimate, if only we could identify a common link between the two paradigms.The matching statistic, m, is such a link. For paired, ranked vectors X and Y of length n, m is the total number of paired observations in X and Y with matching ranks, m = Σ R(Xi) = R(Yi). That is, m is the number of fixed points between vectors. m has a long history in statistics, having served as the test statistic of a little-known null hypothesis statistical test (NHST) for the correlation coefficient, dating to around the turn of the twentieth century, called the matching method. Subtracting m from n yields a metric with a long history in information theory, the Hamming distance, a classic metric of the conditional complexity K(Y|X). Thus, m simultaneously contains both the Fisher information in a bivariate sample about the latent correlation and the conditional complexity or algorithmic information about the manifest observations.This paper shows that the presence of these two conflicting information types in m manifests a peculiar attribute in the statistic: m has an asymptotic efficiency less than or equal to zero relative to conventional correlation estimators computed on the same data. This means its Fisher information content decreases with increasing sample size, so that m’s random component is disproportionately large. Furthermore, when m and Pearson’s r are computed on the same sample, the two share a random component, and the value of m is indicative of the accuracy of r with respect to that component. Having proven this utility of m, by means theoretical and empirical (Monte Carlo simulations), additional matching statistics are constructed, including one composite statistic that is even more informative of the accuracy of r, and another that is indicative of the accuracy of Cohen’s d. Potential applications for computing accuracy-adjusted r are described, and implications are discussed.