An Algorithmic View of Streaming Submodular Data Summarization with A Knapsack Constraint

Enpei Zhang,Kai Han,Benwei Wu
DOI: https://doi.org/10.1109/dsit55514.2022.9943820
2022-01-01
Abstract:Data summarization, in the form of extracting a representative subset (i.e, a data summary) from a massive data set, is often used for big data processing. A good summary can not only significantly reduce the information redundancy, but also provide a better understanding of the original data. The utility function we use to evaluate the quality of a summary usually has a natrual diminishing returns property, also known as submodularity. Due to the rapid growth of data scale, traditional offline data processing has become more and more difficult to deal with massive data, and streaming data processing methods that require less space start to attract attention, leading to the emergence of many related studies. In this paper, we first make an algorithmic view of methods widely used in streaming submodu-lar maximization with knapsack constraint. After analyzing the ideas behind them, we further propose a new algorithm, called RSStream, for the same problem. RSStream is an innovative combination of traditional sieve approach, multi-cadidate set method and augmentation strategy with data sampling. It can achieve the state-of-the-art approximation ratio within a near-linear time and space complexity. At the end, we execute our algorithm on two real data summarization applications to demonstrate the effectiveness and efficiency of it.
What problem does this paper attempt to address?