A Dataset Representativeness Metric and A Slicing Sampling Strategy for the Kennard-Stone Algorithm

Qingying Wu,Zhenyu Zhu,Jianming Wu,Xu Xin
DOI: https://doi.org/10.7503/cjcu20220397
2022-01-01
Abstract:In machine learning with big data,it is essential to prepare a representative dataset for training a model. The Kennard-Stone(KS)algorithm and its derivatives are a large class of excellent dataset splitting methods. But it rely heavily on empirical selection or modeling results to determine the sampling ratio and sampling number. In addition,its computational complexity is O( K3) according to the original literature,making it difficult to apply to massive data. In this paper,we design a metric based on dataset completeness to quantify the representativeness degree of an extracted subset to the whole dataset. An amendment using dynamic programming algorithm is put to reduce the algorithm complexity to O'( K2). And a slicing sampling strategy is proposed to divide the whole dataset into several subset and implement KS sampling respectively,which can further improve the algorithm efficiency to O ''( K). The partial least squares regression test results show that the method can improve the sampling efficiency while still ensuring the representativeness of the finally extracted dataset.
What problem does this paper attempt to address?