Noisy Data Collection Towards Diversity Maximization

Xiang Xiao,Lan Zhang,Xiang-Yang Li
DOI: https://doi.org/10.1109/bigcom.2019.00048
2019-01-01
Abstract:In the big data era, we can take advantages of the big data by machine learning, knowledge discovery and so on. A dataset with good quality can promote the performance of the aforementioned applies. The dataset quality can be assessed from many aspects, for example, completeness, consistency, diversity, etc. In this paper, we will investigate diversity-driven data collection. Most previous works focused on the data collection only consider the accurate data without any noise. However, from the surveys and experiments, we find out that the data generally has noise with a certain probability distribution. Based on this discovery, we take account of the distribution of the data noise in the data space. We construct a comprehensive model to calculate the probability density distribution (abbreviated to PDF) of the distance between two noisy data points. Using the mainstream diversity metric, i.e., the average distance, we propose a more time-saving data collection method compared to the existing generic greedy algorithm.
What problem does this paper attempt to address?