CAMAL: Optimizing LSM-trees via Active Learning

Weiping Yu,Siqiang Luo,Zihao Yu,Gao Cong
2024-09-23
Abstract:We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach Camal, which boasts the following features: (1) ML-Aided: Camal is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2) Decoupled Active Learning: backed by rigorous analysis, Camal adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3) Easy Extrapolation: Camal adopts an effective mechanism to incrementally update the model with the growth of the data size; (4) Dynamic Mode: Camal is able to tune LSM-tree online under dynamically changing workloads; (5) Significant System Improvement: By integrating Camal into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.
Databases,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to optimize the LSM tree structure through machine learning, especially active learning (Active Learning), in order to reduce the cost of processing various read - and - write operations. Specifically, the author proposes a new method, Camal, which aims to optimize the LSM tree in the following ways: 1. **ML - Aided**: Camal is the first attempt to apply active learning to adjust the LSM - tree - based key - value storage system. The learning process is combined with the traditional cost model to improve the training process. 2. **Decoupled Active Learning**: Based on strict analysis, Camal adopts a parameter - decoupled active learning paradigm, which further accelerates the learning process. 3. **Easy Extrapolation**: Camal adopts an effective mechanism to gradually update the model as the amount of data grows. 4. **Dynamic Mode**: Camal can adjust the LSM tree online under a dynamically changing workload. 5. **Significant System Improvement**: By integrating Camal into the complete system RocksDB, the system performance is improved by an average of 28% and up to 8 times at most. ### Main Challenges and New Designs To achieve these goals, the paper addresses the following main challenges: - **Random Initialization Problem**: Avoid random initialization through complexity - analysis - driven techniques, making the initial samples closer to the optimal solution. - **Parameter Decoupling**: Propose a novel hierarchical sampling technique to decouple each parameter from the complex I/O model, thereby finding the ideal solution more quickly. - **Extrapolation Strategy for Data Growth**: Design an extrapolation strategy to quickly transition to new tuning parameters without retraining when the amount of data increases. - **Dynamic Mode**: Design a dynamic LSM tree for a dynamically changing workload, enabling it to adapt to parameter changes. ### Contribution Summary The main contributions of the paper include: - Propose a new model named Camal, which is the first to apply active learning for LSM - tree instance optimization and combines the complexity model with active learning. - Design a novel hierarchical sampling technique, which reduces the sampling space, significantly shortens the training time and improves practicality. - The model Camal can reasonably extrapolate the required settings without retraining, thus better coping with data growth. - Introduce a new design named DLSM, an LSM - tree variant specifically designed to adapt to dynamic workloads. - Evaluate three widely - used machine - learning models and discuss their advantages and disadvantages. - Integrate the method into the widely - used LSM key - value database RocksDB, demonstrating its practical application effect and significantly reducing latency. Through these designs and methods, Camal effectively reduces the end - to - end latency of the LSM tree under different workloads and improves the overall performance of the system.