Abstract:The generation and collection of big data series are becoming an integral part of many emerging applications in sciences, IoT, finance, and web applications among several others. The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics services on data series. Unfortunately, these techniques are heavily geared towards achieving scalability at the cost of sacrificing the results' accuracy. State-of-the-art systems report accuracy below 10% and 40%, respectively, which is not practical for many real-world applications. In this paper, we investigate the root problems in these existing techniques that limit their ability to achieve better a trade-off between scalability and accuracy. Then, we propose a framework, called CLIMBER, that encompasses a novel feature extraction mechanism, indexing scheme, and query processing algorithms for supporting approximate similarity search in big data series. For CLIMBER, we propose a new loss-resistant dual representation composed of rank-sensitive and ranking-insensitive signatures capturing data series objects. Based on this representation, we devise a distributed two-level index structure supported by an efficient data partitioning scheme. Our similarity metrics tailored for this dual representation enables meaningful comparison and distance evaluation between the rank-sensitive and ranking-insensitive signatures. Finally, we propose two efficient query processing algorithms, CLIMBER-kNN and CLIMBER-kNN-Adaptive, for answering approximate kNN similarity queries. Our experimental study on real-world and benchmark datasets demonstrates that CLIMBER, unlike existing techniques, features results' accuracy above 80% while retaining the desired scalability to terabytes of data.

PARROT: Pattern-Based Correlation Exploitation in Big Partitioned Data Series

Distributed Affinity Propagation Clustering Based on MapReduce

climber++: Pivot-Based Approximate Similarity Search over Big Data Series

Parrot: A Progressive Analysis System on Large Text Collections

ParIS+: Data Series Indexing on Multi-Core Architectures

Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

ParSymG: a Parallel Clustering Approach for Unsupervised Classification of Remotely Sensed Imagery

Lightweight Correlation-Aware Table Compression

Scalable Time Series Compound Infrastructure

An Adaptive Data Partitioning Scheme For Accelerating Exploratory Spark Sql Queries

Cortex: Harnessing Correlations to Boost Query Performance

Coconut: sortable summarizations for scalable indexes over static and streaming data series

Real-Time Analytics by Coordinating Reuse and Work Sharing

PARyOpt: A software for Parallel Asynchronous Remote Bayesian Optimization

Position adaptive residual block and knowledge complement strategy for point cloud analysis

Parrot optimizer: Algorithm and applications to medical problems

Odyssey: A Journey in the Land of Distributed Data Series Similarity Search

PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data

Fast Correlation Coefficient Estimation Algorithm for HBase-based Massive Time Series Data