MetaHive: A Cache-Optimized Metadata Management for Heterogeneous Key-Value Stores

Alireza Heidari,Amirhossein Ahmadi,Zefeng Zhi,Wei Zhang
2024-07-27
Abstract:Cloud key-value (KV) stores provide businesses with a cost-effective and adaptive alternative to traditional on-premise data management solutions. KV stores frequently consist of heterogeneous clusters, characterized by varying hardware specifications of the deployment nodes, with each node potentially running a distinct version of the KV store software. This heterogeneity is accompanied by the diverse metadata that they need to manage. In this study, we introduce MetaHive, a cache-optimized approach to managing metadata in heterogeneous KV store clusters. MetaHive disaggregates the original data from its associated metadata to promote independence between them, while maintaining their interconnection during usage. This makes the metadata opaque from the downstream processes and the other KV stores in the cluster. MetaHive also ensures that the KV and metadata entries are stored in the vicinity of each other in memory and storage. This allows MetaHive to optimally utilize the caching mechanism without extra storage read overhead for metadata retrieval. We deploy MetaHive to ensure data integrity in RocksDB and demonstrate its rapid data validation with minimal effect on performance.
Databases,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in a heterogeneous key - value (KV) storage cluster, how to manage and optimize metadata efficiently and securely to ensure data integrity and access performance. ### Specific problems and challenges: 1. **Performance issues**: - In many application scenarios, an application needs to access its corresponding metadata immediately after reading KV data (for example, to verify the correctness of the data). To improve cache efficiency and reduce cache misses and memory page lookups, the metadata should be placed in the same memory block as the corresponding KV data. 2. **Heterogeneity issues**: - Cloud databases usually adopt a distributed KV storage architecture, and each KV storage shard is hosted on a separate node. These nodes may have different hardware specifications, and the KV storage software versions on each node may also be different. In such a diverse KV storage cluster, the introduction of KV metadata should not affect any version of the KV storage application, which requires the design to have backward and forward compatibility. 3. **Privacy issues**: - In a cluster environment, the KV shards on each node contain a subset of all KVs, and these KVs may be private data specific to each node. Therefore, the metadata containing information about the node's key - value should be stored on the same edge node and should not be migrated to other shards to ensure data privacy. ### Shortcomings of existing methods: 1. **Adding checksums to the KV payload**: - This method is not compatible in a heterogeneous RocksDB cluster because other versions of RocksDB cannot interpret the changes in the data structure. Moreover, for some types of metadata (such as statistical information required for ETL), this method will cause unnecessary overhead and potential security risks. 2. **Adding checksums to the Footer Block**: - Although this method keeps the KV pairs unchanged and is suitable for heterogeneous RocksDB clusters, its cache optimization is not good because the metadata is separated from the target KV on different memory pages, resulting in two memory page accesses each time the KV metadata is obtained, increasing the processing cost. ### Design goals of MetaHive: MetaHive aims to solve the above problems in the following ways: - **Cache optimization**: Ensure that KV data and its corresponding metadata are stored in the same memory page to improve cache efficiency. - **Heterogeneous compatibility**: Ensure that the system can work normally in different versions of KV storage software, achieving backward and forward compatibility. - **Privacy protection**: Ensure that the metadata of each node is only stored locally and not migrated to other nodes or shards. Through these designs, MetaHive can manage metadata efficiently and securely in a heterogeneous KV storage cluster, ensuring data integrity and access performance.