DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup

Lixi Zhou,K. Selçuk Candan,Jia Zou
DOI: https://doi.org/10.1109/ICDE60146.2024.00008
2024-09-26
Abstract:Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.
Databases
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to achieve efficient compression and fast query under limited computing and storage resources when storing and querying tabular data in edge devices. Specifically, the paper proposes a novel data abstraction method named DeepMapping, which utilizes the memory ability of deep neural networks to integrate compression and indexing functions in order to better balance storage cost, query latency and runtime memory footprint. ### Core Problems of the Paper 1. **Real - time and Resource Constraints**: As real - time computing is increasingly pushed to edge servers with limited computing and storage capabilities, how to balance storage (such as disk and memory) and computing costs (such as query execution latency) on such platforms to achieve real - time response has become a key issue. 2. **Limitations of Existing Methods**: - **Regression Approximation**: For example, ModelarDB uses regression to approximate piecewise numerical data, but it needs to scan each segment, resulting in high query latency. - **Ordered Compression**: For example, separation coding compresses by forcing ordering, but it requires binary search, also resulting in high query latency. 3. **Importance of Queries**: For many emerging edge applications (such as self - service retail, quality control in large - scale manufacturing, autonomous robots, etc.), random query and update are essential functions. However, existing solutions are not effective in integrating compression and indexing techniques to achieve both low storage cost and low query latency simultaneously. ### DeepMapping's Solution DeepMapping utilizes the powerful memory ability of deep neural networks to convert the data set into multiple key - value mappings and constructs a multi - task neural network model that can output the value corresponding to a given input key. To handle memory errors, DeepMapping combines the learned neural network with a lightweight auxiliary data structure to correct errors. In addition, the auxiliary structure is designed so that DeepMapping can efficiently handle insertion, deletion and update operations without retraining the mapping. ### Key Contributions 1. **Novel Hybrid Data Representation**: - **Compact Multi - task Neural Network Model**: Used to capture the correlation between keys (input features) and values (labels). - **Auxiliary Precision - guaranteeing Structure**: Compresses misclassified data of the model and records the existence of data to ensure query accuracy. 2. **Multi - task Hybrid Architecture Search (MHAS)**: - Adaptively adjusts the number and size of shared and private layers through deep reinforcement learning to minimize the overall size of the hybrid architecture. 3. **Workflow Supporting Insertion, Deletion and Update**: - Proposes a lazy update process. By implementing modification operations in the auxiliary structure, the retraining of the neural network model is triggered only when the size of the auxiliary structure exceeds a threshold. ### Experimental Results The experimental results show that DeepMapping outperforms existing baseline methods on TPC - H, TPC - DS, synthetic data sets and real - world data sets, achieving a speedup of up to 15 times in scenarios with limited memory capacity and significantly reducing I/O and decompression costs. ### Summary DeepMapping provides a novel and effective solution by combining deep learning and auxiliary data structures, which can achieve efficient data compression and fast query in edge devices and solve the deficiencies of existing methods in terms of accuracy and efficiency.