Camel: Managing Data for Efficient Stream Learning

Yiming Li,Yanyan Shen,Lei Chen
DOI: https://doi.org/10.1145/3514221.3517836
2022-01-01
Abstract:Many real-world applications rely on predictive models that are incrementally learned online. Specifically, models are updated with a single pass over continuously arriving data batches in a typical stream learning framework. However, this framework has three shortcomings: high training cost, low data effectiveness, and catastrophic forgetting. We describe Camel, a system that addresses the above issues. Camel includes two independent data management components: coreset selection and buffer update. To accelerate model training, Camel selects a coreset from each streaming data batch for model update. Selecting a coreset with worst-case guarantees is NP-hard. To solve this problem, we reformulate coreset selection as a submodular maximization problem by deriving an upper bound on the objective function. To mitigate catastrophic forgetting, Camel maintains a buffer of past representative samples as new data arrive. Moreover, Camel quantizes numerical data in buffer via a quantile sketch to reduce the memory footprint. Finally, extensive experiments validate the effectiveness and efficiency of Camel. In particular, our coreset selection algorithm can achieve a linear speedup with a marginal accuracy loss on redundant datasets. Furthermore, our buffer update algorithms can outperform the state-of-the-art methods for anti-forgetting on various data distributions.
What problem does this paper attempt to address?