Abstract:Operations over data streams typically hinge on efficient mechanisms to aggregate or summarize history on a rolling basis. For high-volume data steams, it is critical to manage state in a manner that is fast and memory efficient -- particularly in resource-constrained or real-time contexts. Here, we address the problem of extracting a fixed-capacity, rolling subsample from a data stream. Specifically, we explore ``data stream curation'' strategies to fulfill requirements on the composition of sample time points retained. Our ``DStream'' suite of algorithms targets three temporal coverage criteria: (1) steady coverage, where retained samples should spread evenly across elapsed data stream history; (2) stretched coverage, where early data items should be proportionally favored; and (3) tilted coverage, where recent data items should be proportionally favored. For each algorithm, we prove worst-case bounds on rolling coverage quality. We focus on the more practical, application-driven case of maximizing coverage quality given a fixed memory capacity. As a core simplifying assumption, we restrict algorithm design to a single update operation: writing from the data stream to a calculated buffer site -- with data never being read back, no metadata stored (e.g., sample timestamps), and data eviction occurring only implicitly via overwrite. Drawing only on primitive, low-level operations and ensuring full, overhead-free use of available memory, this ``DStream'' framework ideally suits domains that are resource-constrained, performance-critical, and fine-grained (e.g., individual data items as small as single bits or bytes). The proposed approach supports $\mathcal{O}(1)$ data ingestion via concise bit-level operations. To further practical applications, we provide plug-and-play open-source implementations targeting both scripted and compiled application domains.

Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary

Concept Drift Based Multi-dimensional Data Streams Sampling Method.

An Algorithm for Data Stream Sampling Based on Ring Circular Sliding Window Tightly-Coupled with Buffer

FPCS: Feature Preserving Compensated Sampling of Streaming Time Series Data

Adaptive-Size Reservoir Sampling over Data Streams

Online Feature Selection for Streaming Features with High Redundancy Using Sliding-Window Sampling

Detecting Change in Data Stream: Using Sampling Technique

Progressive online aggregation in a distributed stream system

Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams

Continuously Distinct Sampling over Centralized and Distributed High Speed Data Streams

Cluster-preserving sampling from fully-dynamic streaming graphs.

Optimal Sampling Designs for Multi-dimensional Streaming Time Series with Application to Power Grid Sensor Data

Feature Selection in the Data Stream Based on Incremental Markov Boundary Learning

Fair Streaming Feature Selection

A Time-Series-Based Sample Amplification Model for Data Stream with Sparse Samples

RPS: A Generic Reservoir Patterns Sampler

Continuously Extracting High-Quality Representative Set from Massive Data Streams.

Clustering-Structure Representative Sampling from Graph Streams.

Stream Aggregation Through Order Sampling

Summarizing Stream Data for Memory-Constrained Online Continual Learning

Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams