Determining the Sampling Size with Maintaining the Probability Distribution.

Jiaoyun Yang,Zhenyu Ren,Junda Wang,Lian Li
DOI: https://doi.org/10.1007/978-981-19-8152-4_4
2022-01-01
Abstract:Sampling is a fundamental method in data science, which can reduce the dataset size and decrease the computational complexity. A basic sampling requirement is identically distributed sampling, which requires maintaining the probability distribution. Numerous sampling methods are proposed. However, how to estimate the sampling boundary under the constraint of the probability distribution is still unclear. In this paper, we formulate a Probably Approximate Correct (PAC) problem for sampling, which limits the distribution difference in the given error boundary with the given confidence level. We further apply Hoeffding’s inequality to estimate the sampling size by decomposing the joint probability distribution into conditional distributions based on Bayesian networks. In the experiments, we simulate 5 Bayesian datasets with size 1, 000, 000 and give out the sampling size with different error boundaries and confidence levels. When the error boundary is 0.05, and the confidence level is 0.99, at least $$80\%$$ samples could be excluded according to the estimated sampling size.
What problem does this paper attempt to address?