On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDB

Chenguang Fang,Zijie Chen,Shaoxu Song,Xiangdong Huang,Chen Wang,Jianmin Wang
DOI: https://doi.org/10.14778/3681954.3681977
IF: 2.5
2024-07-01
Proceedings of the VLDB Endowment
Abstract:Log-structured merge trees (LSM-trees) are commonly employed as the storage engines for write-intensive workloads in modern time series databases including Apache IoTDB. Following append-only principle, LSM-trees can handle intensive writes and updates, but consequently suffer high space amplification (SA). To reduce SA in LSM-tree, compaction is triggered periodically to reorganize a large number of immutable files on disk to eliminate redundancy. This issue is further complicated in the Internet of Things (IoT) scenarios, where frequent out-of-order data insertions and data updates introduce duplicated keys, obsolete values and overlapping bitmaps in multi-column data, thereby exacerbating SA concerns. To mitigate SA in such contexts, this paper presents a Multi-Column Compaction (MCC) strategy in Apache IoTDB, an open-source time series database utilizing LSM-tree architecture and supporting multi-column storage. We take into consideration both the separate insertions (out-of-order data) and updates of multi-column data, and analyze the hardness of selecting proper files with the maximum space reduction in compaction. We then propose a heuristic method designed to improve the file selection, thus reducing SA. To enhance the efficiency of this approach, we further devise File Prefetcher and Compaction Cache. The proposed MCC has been implemented in Apache IoTDB. Experimental results demonstrate that our proposed MCC achieves better performance in reducing space amplification.
computer science, information systems, theory & methods
What problem does this paper attempt to address?