Separation or Not: on Handing Out-of-Order Time-Series Data in Leveled LSM-Tree

Yuyuan Kang,Xiangdong Huang,Shaoxu Song,Lingzhe Zhang,Jialin Qiao,Chen Wang,Jianmin Wang,Julian Feinauer
DOI: https://doi.org/10.1109/icde53745.2022.00315
2022-01-01
Abstract:LSM-Tree is widely adopted for storing time-series data in Internet of Things. According to conventional policy (denoted by $\pi_{c}$ ), when writing, the data will first be buffered in MemTable in memory. When it is full, the data will be written to the disk to form SSTables. Compaction is triggered to sort the data in each layer of the LSM-Tree on the disk. However, the arrival of data can be unordered due to reasons such as transition delay. Apache IoTDB uses in-order and out-of-order MemTables to separately buffer the in-order and out-of-order data to accelerate queries, namely the separation policy (denoted by $\pi_{s}$ ). However, given a specific space of memory budget to buffer the data, write amplification (WA) of the leveled LSM-Tree will be influenced by $\pi_{s}$ . Whether the influence by separation is positive or negative, and how intense WA is influenced, depend on the properties of workloads and the capacity of the in-order and out-of-order MemTables. It is highly demanded to build robust models for estimating the expected amount of data rewritten in each compaction, and predicting the WA under $\pi_{c}$ and $\pi_{s}$ . Note that as an industrial paper, rather than proposing novel techniques for research problems, we focus on the practice of whether separating or not for lower write amplification. Experiments on synthetic and real-world datasets show that the models for estimating WA are accurate under various delay distributions. In addition, based on the estimation models, we implement an analyzer module in the open-source Apache IoTDB, for choosing the policy with lower WA. We apply the method in the use case of our industrial partner, a service provider of engineering machinery. The use case verifies the effectiveness of deciding whether separation or not by WA estimation.
What problem does this paper attempt to address?