Abstract:We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach Camal, which boasts the following features: (1) ML-Aided: Camal is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2) Decoupled Active Learning: backed by rigorous analysis, Camal adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3) Easy Extrapolation: Camal adopts an effective mechanism to incrementally update the model with the growth of the data size; (4) Dynamic Mode: Camal is able to tune LSM-tree online under dynamically changing workloads; (5) Significant System Improvement: By integrating Camal into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to optimize the LSM tree structure through machine learning, especially active learning (Active Learning), in order to reduce the cost of processing various read - and - write operations. Specifically, the author proposes a new method, Camal, which aims to optimize the LSM tree in the following ways: 1. **ML - Aided**: Camal is the first attempt to apply active learning to adjust the LSM - tree - based key - value storage system. The learning process is combined with the traditional cost model to improve the training process. 2. **Decoupled Active Learning**: Based on strict analysis, Camal adopts a parameter - decoupled active learning paradigm, which further accelerates the learning process. 3. **Easy Extrapolation**: Camal adopts an effective mechanism to gradually update the model as the amount of data grows. 4. **Dynamic Mode**: Camal can adjust the LSM tree online under a dynamically changing workload. 5. **Significant System Improvement**: By integrating Camal into the complete system RocksDB, the system performance is improved by an average of 28% and up to 8 times at most. ### Main Challenges and New Designs To achieve these goals, the paper addresses the following main challenges: - **Random Initialization Problem**: Avoid random initialization through complexity - analysis - driven techniques, making the initial samples closer to the optimal solution. - **Parameter Decoupling**: Propose a novel hierarchical sampling technique to decouple each parameter from the complex I/O model, thereby finding the ideal solution more quickly. - **Extrapolation Strategy for Data Growth**: Design an extrapolation strategy to quickly transition to new tuning parameters without retraining when the amount of data increases. - **Dynamic Mode**: Design a dynamic LSM tree for a dynamically changing workload, enabling it to adapt to parameter changes. ### Contribution Summary The main contributions of the paper include: - Propose a new model named Camal, which is the first to apply active learning for LSM - tree instance optimization and combines the complexity model with active learning. - Design a novel hierarchical sampling technique, which reduces the sampling space, significantly shortens the training time and improves practicality. - The model Camal can reasonably extrapolate the required settings without retraining, thus better coping with data growth. - Introduce a new design named DLSM, an LSM - tree variant specifically designed to adapt to dynamic workloads. - Evaluate three widely - used machine - learning models and discuss their advantages and disadvantages. - Integrate the method into the widely - used LSM key - value database RocksDB, demonstrating its practical application effect and significantly reducing latency. Through these designs and methods, Camal effectively reduces the end - to - end latency of the LSM tree under different workloads and improves the overall performance of the system.

CAMAL: Optimizing LSM-trees via Active Learning

BushStore: Efficient B+Tree Group Indexing for LSM-Tree in Non-Volatile Memory

<i>SA-LSM</i>: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis

SA-LSM

A Scalable 2T-1Fefet Based Content Addressable Memory Design for Energy Efficient Data Search

Learning Autoregressive Model in LSM-Tree Based Store

Autumn: A Scalable Read Optimized LSM-tree based Key-Value Stores with Fast Point and Range Read Speed

Towards flexibility and robustness of LSM trees

LearnedKV: Integrating LSM and Learned Index for Superior Performance on SSD

SplitDB: Closing the Performance Gap for LSM-Tree-Based Key-Value Stores

An Update-intensive LSM-based R-tree Index

MTDB: an LSM-tree-based key-value store using a multi-tree structure to improve read performance

FPGA-Accelerated Compactions for LSM-based Key-Value Store.

vLSM: Low tail latency and I/O amplification in LSM-based KV stores

Accelerating LSM-Tree with the Dentry Management of File System

Optimizing LSM-based indexes for disaggregated memory

Closing the B-tree vs. LSM-tree Write Amplification Gap on Modern Storage Hardware with Built-in Transparent Compression

Endure: A Robust Tuning Paradigm for LSM Trees Under Workload Uncertainty

Breaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version)

On Integration of Appends and Merges in Log-Structured Merge Trees