A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge

Yikun Han,Chunjiang Liu,Pengfei Wang
2023-10-18
Abstract:A vector database is used to store high-dimensional data that cannot be characterized by traditional DBMS. Although there are not many articles describing existing or introducing new vector database architectures, the approximate nearest neighbor search problem behind vector databases has been studied for a long time, and considerable related algorithmic articles can be found in the literature. This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area. The basis of our framework categorises these studies by the approach of solving ANNS problem, respectively hash-based, tree-based, graph-based and quantization-based approaches. Then we present an overview of existing challenges for vector databases. Lastly, we sketch how vector databases can be combined with large language models and provide new possibilities.
Databases,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores the research progress and challenges faced by Vector Databases and proposes the possibility of combining Vector Databases with large language models. Specifically: 1. **Research Background**: - Vector databases are used to store high-dimensional data, which cannot be effectively managed by traditional Database Management Systems (DBMS). - Although there are currently few articles on existing or new vector database architectures, the problem of Nearest Neighbor Search (NNS) has been studied for a long time, and related algorithms are documented in the literature. 2. **Research Purpose**: - This paper aims to comprehensively review related algorithms and provide an overall understanding of this rapidly developing research field. - The paper organizes these studies through a classification method, mainly dividing them into hash-based, tree-based, graph-based, and quantization-based methods to solve the Approximate Nearest Neighbor Search (ANNS) problem. - The paper also outlines the existing challenges of vector databases and discusses how to combine vector databases with large language models to provide new possibilities. 3. **Main Content**: - **Storage Technologies**: Including sharding, partitioning, caching, and replication, these technologies help improve the scalability and performance of vector databases. - **Search Technologies**: Detailed introduction of different algorithms for Nearest Neighbor Search (NNS) and Approximate Nearest Neighbor Search (ANNS), such as tree-based methods (KD-tree, Ball-tree, R-tree, M-tree) and hash-based methods (Locality-Sensitive Hashing, Spectral Hashing, Deep Hashing). 4. **Core Contributions**: - By providing a comprehensive literature review, it offers a thorough understanding of vector databases and related technologies. - It explores the potential of future integration of vector databases with large language models, pointing out the direction for the development of this field. In summary, this paper aims to provide a comprehensive framework for the research of vector databases by systematically reviewing existing technologies and proposing new development directions.