Abstract:Compositional data, which is data consisting of fractions or probabilities, is common in many fields including ecology, economics, physical science and political science. If these data would otherwise be normally distributed, their spread can be conveniently represented by a multivariate normal distribution truncated to the non-negative space under a unit simplex. Here this distribution is called the simplex-truncated multivariate normal distribution. For calculations on truncated distributions, it is often useful to obtain rapid estimates of their integral, mean and covariance; these quantities characterising the truncated distribution will generally possess different values to the corresponding non-truncated distribution. In this paper, three different approaches that can estimate the integral, mean and covariance of any simplex-truncated multivariate normal distribution are described and compared. These three approaches are (1) naive rejection sampling, (2) a method described by Gessner et al. that unifies subset simulation and the Holmes-Diaconis-Ross algorithm with an analytical version of elliptical slice sampling, and (3) a semi-analytical method that expresses the integral, mean and covariance in terms of integrals of hyperrectangularly-truncated multivariate normal distributions, the latter of which are readily computed in modern mathematical and statistical packages. Strong agreement is demonstrated between all three approaches, but the most computationally efficient approach depends strongly both on implementation details and the dimension of the simplex-truncated multivariate normal distribution. For computations in low-dimensional distributions, the semi-analytical method is fast and thus should be considered. As the dimension increases, the Gessner et al. method becomes the only practically efficient approach of the methods tested here.

Statistical Query Lower Bounds for Learning Truncated Gaussians

SQ Lower Bounds for Learning Bounded Covariance GMMs

Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians

SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions

Detecting Low-Degree Truncation

SQ Lower Bounds for Learning Mixtures of Linear Classifiers

Non-Stochastic CDF Estimation Using Threshold Queries

Query lower bounds for log-concave sampling

Better and Simpler Lower Bounds for Differentially Private Statistical Estimation

A Non-Parametric Shrinkage Mean Estimation for Arbitrary Quadratic Loss Functions and Unknown Covariance Matrices

Testing Convex Truncation

Gaussian universality for approximately polynomial functions of high-dimensional data

Robust Sparse Mean Estimation via Sum of Squares

Slow rates of approximation of U-statistics and V-statistics by quadratic forms of Gaussians

Learning multivariate Gaussians with imperfect advice

Integral, mean and covariance of the simplex-truncated multivariate normal distribution

Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression

Tight Bounds for Local Glivenko-Cantelli

Efficient Parameter Estimation of Truncated Boolean Product Distributions

Exactly Tight Information-Theoretic Generalization Error Bound for the Quadratic Gaussian Problem

Locally Private Gaussian Estimation