Delta Tensor: Efficient Vector and Tensor Storage in Delta Lake

Zhiwei Bao,Liu Liao-Liao,Zhiyu Wu,Yifan Zhou,Dan Fan,Michal Aibin,Yvonne Coady,Andrew Brownsword
2024-05-13
Abstract:The exponential growth of artificial intelligence (AI) and machine learning (ML) applications has necessitated the development of efficient storage solutions for vector and tensor data. This paper presents a novel approach for tensor storage in a Lakehouse architecture using Delta Lake. By adopting the multidimensional array storage strategy from array databases and sparse encoding methods to Delta Lake tables, experiments show that this approach has demonstrated notable improvements in both space and time efficiencies when compared to traditional serialization of tensors. These results provide valuable insights for the development and implementation of optimized vector and tensor storage solutions in data-intensive applications, contributing to the evolution of efficient data management practices in AI and ML domains in cloud-native environments
Distributed, Parallel, and Cluster Computing,Databases,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily explores methods for efficiently storing vector and tensor data in artificial intelligence (AI) and machine learning (ML) applications. Specifically: 1. **Background and Motivation**: - With the development of large-scale language models (LLM) and other foundational models, training and operations require handling massive datasets. - These datasets often contain multimodal data such as text, speech, images, and videos, which can reach petabyte scales. - Data exists in the form of vectors and tensors, and traditional simple serialization methods are inefficient for storing this data. 2. **Research Objectives**: - Explore and develop efficient tensor storage technologies, particularly in cloud computing environments, to improve storage space and time efficiency. - Investigate how to achieve efficient tensor storage on the Delta Lake storage layer, leveraging the advantages of cloud object storage (e.g., Amazon S3). - Propose a new method to optimize tensor storage and validate its effectiveness through experiments. 3. **Specific Issues**: - Currently, tensor data is typically stored in databases as binary serialized files, which cannot fully utilize storage space. - For sparse tensors, existing methods waste a significant amount of storage space. - This research aims to apply existing storage and encoding technologies to tensors to improve storage and processing efficiency. 4. **Main Contributions**: - Designed and implemented 5 different tensor storage methods and evaluated their compression ratios and read/write performance. - Focused on achieving efficient tensor storage in cloud object storage environments, a relatively unexplored area in previous research. Through this research, the authors hope to provide more efficient data management practices for future AI and ML applications, especially in cloud computing environments.