Subdata selection based on orthogonal array for big data

Min Ren,Sheng-Li Zhao
DOI: https://doi.org/10.1080/03610926.2021.2012196
2021-12-13
Abstract:Many branches of contemporary science are generating large amounts of data. Due to the limitation of calculation time and cost, traditional statistical methods are no longer applicable to large data sets. For a very large data set containing N points, an effective method is to extract n ( ) points for research, so that the subsampled n points represent the full sample as much as possible, and the information contained in the subdata will not be lost a lot. It is necessary to design an algorithm for selecting sample points. Orthogonal subsampling for big data based on two-level orthogonal array is a popular approach. Based on the projection properties of orthogonal array, this paper defines a new discrepancy function to evaluate the quality of the selected subdata and proposes three algorithms to select subdata according to different situations. Simulation studies show that the new algorithms have higher A-efficiency and D-efficiency and perform well in minimizing the mean squared errors of the estimated parameters.
What problem does this paper attempt to address?