Multivariate, Heteroscedastic Empirical Bayes via Nonparametric Maximum Likelihood

Jake A. Soloff,Adityanand Guntuboyina,Bodhisattva Sen
2023-12-30
Abstract:Multivariate, heteroscedastic errors complicate statistical inference in many large-scale denoising problems. Empirical Bayes is attractive in such settings, but standard parametric approaches rest on assumptions about the form of the prior distribution which can be hard to justify and which introduce unnecessary tuning parameters. We extend the nonparametric maximum likelihood estimator (NPMLE) for Gaussian location mixture densities to allow for multivariate, heteroscedastic errors. NPMLEs estimate an arbitrary prior by solving an infinite-dimensional, convex optimization problem; we show that this convex optimization problem can be tractably approximated by a finite-dimensional version. The empirical Bayes posterior means based on an NPMLE have low regret, meaning they closely target the oracle posterior means one would compute with the true prior in hand. We prove an oracle inequality implying that the empirical Bayes estimator performs at nearly the optimal level (up to logarithmic factors) for denoising without prior knowledge. We provide finite-sample bounds on the average Hellinger accuracy of an NPMLE for estimating the marginal densities of the observations. We also demonstrate the adaptive and nearly-optimal properties of NPMLEs for deconvolution. We apply our method to two denoising problems in astronomy, constructing a fully data-driven color-magnitude diagram of 1.4 million stars in the Milky Way and investigating the distribution of 19 chemical abundance ratios for 27 thousand stars in the red clump. We also apply our method to hierarchical linear models, illustrating the advantages of nonparametric shrinkage of regression coefficients on an education data set and on a microarray data set.
Statistics Theory
What problem does this paper attempt to address?
This paper attempts to solve the problem of statistical inference in multivariate and heteroscedastic environments, especially in large - scale denoising problems. Specifically, the paper focuses on how to use non - parametric maximum likelihood estimation (NPMLE) to handle high - dimensional, heteroscedastic data without assuming the form of prior distribution. Traditional parametric methods usually rely on specific assumptions about the prior distribution, which may be difficult to verify and introduce unnecessary parameter - tuning problems. The method proposed in this paper aims to overcome these problems and provides a method for efficient denoising without prior knowledge of the prior distribution by extending NPMLE to situations that allow for multivariate and heteroscedastic errors. ### Main contributions of the paper include: 1. **Extension of NPMLE**: The paper extends NPMLE from the one - dimensional, homoscedastic case to the multi - dimensional, heteroscedastic case. This extension enables NPMLE to be applied to a wider range of statistical problems, especially those involving complex data structures. 2. **Theoretical guarantees**: The author proves several important properties of NPMLE in the multivariate and heteroscedastic cases, including the conditions for existence, uniqueness, and non - uniqueness. In addition, theoretical results on the finite - sample risk bounds of NPMLE in density estimation, denoising, and deconvolution problems are also provided. 3. **Practical applications**: The paper shows the application of this method in astronomical data processing, such as constructing the color - magnitude diagram (CMD) of 1.4 million stars in the Milky Way, and analyzing 19 chemical abundance ratios of 27,000 red giant stars. In addition, this method has also been applied to hierarchical linear models, such as the analysis of educational data sets and microarray data sets. ### Specific problem - solving: - **Multivariate, heteroscedastic errors**: The paper deals with the problem of heteroscedastic errors in multivariate data, which is a common challenge in many practical problems. For example, in astronomy, each observation usually comes with a known measurement error distribution, and these errors are usually heteroscedastic. - **Non - parametric prior estimation**: The method proposed in the paper does not require specific assumptions about the prior distribution, but directly estimates the prior distribution from the data through NPMLE. This makes the method more flexible and suitable for various complex prior structures. - **Efficient computational methods**: The paper proposes several efficient methods for calculating NPMLE, including the gridding method and the exemplar method, which can operate effectively on high - dimensional data. ### Conclusion: Through theoretical analysis and practical applications, the paper shows the powerful ability of NPMLE in handling multivariate and heteroscedastic data. This method not only has good properties in theory, but also performs well in practical applications, especially in data processing in astronomy and biomedicine.