Abstract:Columnar storage is now an industry standard design in most open-source or commercial time series database products, making them HTAP systems. The time column of a time series serves as the key for identifying the other value column, namely single-column storage scheme. When multiple time series share a similar set of timestamps, very likely in a module of multiple sensors, it is natural to group them together, i.e., one time column identifies multiple value columns in a single-group storage scheme. While multiple value columns sharing the same time column reduce the space cost of repeating timestamps, it may introduce extra space cost for recording null values. The reason is that time series may not be exactly aligned on each timestamp, owing to missing values, distinct data collection frequencies, unsynchronized clocks and so on. The columngroups storage scheme is thus to divide columns into multiple groups, within which the value columns share the same time column. Unfortunately, the problem of finding the optimal column groups for the minimum space cost is highly challenging, NP-hard according to our analysis. Thereby, we propose a heuristic algorithm for automatically grouping time series for efficient columnar storage. The column groups storage has been deployed in Apache IoTDB, an open-source time series database. The extensive performance analysis, over real-world data from our industrial partners, demonstrates that the proposed column groups achieve near optimal storage, more concise than the storage of single-column or single-group schemes. Interestingly, both the flushing and querying time costs of column groups are comparable to those of single-column or singlegroup, i.e., without incurring extra time cost.

Wide Table Layout Optimization Based on Column Ordering and Duplication

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Column ordering for input/output optimization in tabular data

Cost-Based Optimization Of Logical Partitions For A Query Workload In A Hadoop Data Warehouse

An Optimized Learning-Based Directory Placement Policy with Two-Rounds Selection in Distributed File Systems

Dynamic Data Layout Optimization with Worst-case Guarantees

Layout-Conscious Optimization: Beyond Hybrid Row-Column Storage Model

An Empirical Evaluation of Columnar Storage Formats

Column-Oriented Storage Techniques for MapReduce

Columnar Formats for Schemaless LSM-based Document Stores

KCGS-Store: A Columnar Storage Based on Group Sorting of Key Columns

Towards Optimizing Storage Costs on the Cloud

A Novel Optimization Method to Improve De-duplication Storage System Performance

Query Optimization and Rebalancing Methods based on CMD.

Leveraging Column Family to Improve Multidimensional Query Performance in HBase

Heterogeneous Replicas for Multi-dimensional Data Management

Qd-tree: Learning Data Layouts for Big Data Analytics

<i>SA-LSM</i>: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis

Optimizing Parallel I/O Accesses Through Pattern-Directed and Layout-Aware Replication

Grouping Time Series for Efficient Columnar Storage.

Optimize Multidimensional Arrays Queries with Heterogeneous Replica Method