Abstract:Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art subsequence matching techniques do not scale well to TB-scale datasets. Not only does index construction become prohibitively expensive, but also the query response time deteriorates quickly as the length of the query subsequence exceeds several 100s of data points. Although Locality Sensitive Hashing (LSH) has emerged as a promising solution for indexing long time series, it relies on expensive hash functions that perform multiple passes over the data and thus is impractical for big time series. In this work, we propose a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data. As a foundation of ChainLink, we design a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem. In particular, we prove theoretically and demonstrate experimentally that the similarity proximity of the indexed subsequences is preserved by our proposed single-pass SPS scheme. Leveraging this SPS innovation, Chainlink then adopts a three-step approach for scalable index building: (1) in-place data re-organization within each partition to enable efficient record-level random access to all subsequences, (2) parallel building of hash-based local indices on top of the re-organized data using our SPS scheme for efficient search within each partition, and (3) efficient aggregation of the local indices to construct a centralized yet highly compact global index for effective pruning of irrelevant partitions during query processing. ChainLink achieves the above three steps in one single map-reduce process. Our experimental evaluation shows that ChainLink indices are compact at less than 2% of dataset size while state-of-the-art index sizes tend to be almost the same size as the dataset. Better still, ChainLink is up to 2 orders of magnitude faster in its index construction time compared to state-of-the-art techniques, while improving both the final query response time by up to 10 fold and the result accuracy by 15%.

climber++: Pivot-Based Approximate Similarity Search over Big Data Series

Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search

Scalable Time Series Compound Infrastructure

PARROT: Pattern-Based Correlation Exploitation in Big Partitioned Data Series

Preserving-Ignoring Transformation Based Index for Approximate k Nearest Neighbor Search

Let them have CAKES: A Cutting-Edge Algorithm for Scalable, Efficient, and Exact Search on Big Data

TARDIS: Distributed Indexing Framework for Big Time Series Data

Big Data Series Analytics Using TARDIS and Its Exploitation in Geospatial Applications.

ChainLink: Indexing Big Time Series Data For Long Subsequence Matching

Pairing Clustered Inverted Indexes with kNN Graphs for Fast Approximate Retrieval over Learned Sparse Representations

DIDS: Double Indices and Double Summarizations for Fast Similarity Search

DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search

Unconventional application of k-means for distributed approximate similarity search

REPOSE: Distributed Top-k Trajectory Similarity Search with Local Reference Point Tries

An Efficient Framework for Exact Set Similarity Search Using Tree Structure Indexes.

Boosting cluster tree with reciprocal nearest neighbors scoring

Performance analysis for similarity data fusion model for enabling time series indexing in internet of things applications

B +-Tree Based Multi-Keyword Ranked Similarity Search Scheme Over Encrypted Cloud Data

Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and Improvement

Early Exit Strategies for Approximate k-NN Search in Dense Retrieval

A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search