Abstract:A vector database is used to store high-dimensional data that cannot be characterized by traditional DBMS. Although there are not many articles describing existing or introducing new vector database architectures, the approximate nearest neighbor search problem behind vector databases has been studied for a long time, and considerable related algorithmic articles can be found in the literature. This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area. The basis of our framework categorises these studies by the approach of solving ANNS problem, respectively hash-based, tree-based, graph-based and quantization-based approaches. Then we present an overview of existing challenges for vector databases. Lastly, we sketch how vector databases can be combined with large language models and provide new possibilities.

What problem does this paper attempt to address?

The paper primarily explores the research progress and challenges faced by Vector Databases and proposes the possibility of combining Vector Databases with large language models. Specifically: 1. **Research Background**: - Vector databases are used to store high-dimensional data, which cannot be effectively managed by traditional Database Management Systems (DBMS). - Although there are currently few articles on existing or new vector database architectures, the problem of Nearest Neighbor Search (NNS) has been studied for a long time, and related algorithms are documented in the literature. 2. **Research Purpose**: - This paper aims to comprehensively review related algorithms and provide an overall understanding of this rapidly developing research field. - The paper organizes these studies through a classification method, mainly dividing them into hash-based, tree-based, graph-based, and quantization-based methods to solve the Approximate Nearest Neighbor Search (ANNS) problem. - The paper also outlines the existing challenges of vector databases and discusses how to combine vector databases with large language models to provide new possibilities. 3. **Main Content**: - **Storage Technologies**: Including sharding, partitioning, caching, and replication, these technologies help improve the scalability and performance of vector databases. - **Search Technologies**: Detailed introduction of different algorithms for Nearest Neighbor Search (NNS) and Approximate Nearest Neighbor Search (ANNS), such as tree-based methods (KD-tree, Ball-tree, R-tree, M-tree) and hash-based methods (Locality-Sensitive Hashing, Spectral Hashing, Deep Hashing). 4. **Core Contributions**: - By providing a comprehensive literature review, it offers a thorough understanding of vector databases and related technologies. - It explores the potential of future integration of vector databases with large language models, pointing out the direction for the development of this field. In summary, this paper aims to provide a comprehensive framework for the research of vector databases by systematically reviewing existing technologies and proposing new development directions.

A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge

Survey of Vector Database Management Systems

Vector Database Management Techniques and Systems

Vector database management systems: Fundamental concepts, use-cases, and current challenges

When Large Language Models Meet Vector Databases: A Survey

Quantixar: High-performance Vector Data Management System

Approximate Vector Set Search: A Bio-Inspired Approach for High-Dimensional Spaces

Fast Search In Large-Scale Image Database Using Vector Quantization

Vector Spatial Big Data Storage and Optimized Query Based on the Multi-Level Hilbert Grid Index in HBase

Manu: A Cloud Native Vector Database Management System

An efficient encoding algorithm for vector quantization based on subvector technique.

Vector Quantization for Recommender Systems: A Review and Outlook

Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search

Fast, Approximate Vector Queries on Very Large Unstructured Datasets

HBaseSpatial: A Scalable Spatial Data Storage Based on HBase

Foundations of Vector Retrieval

Indexing very high-dimensional sparse and quasi-sparse vectors for similarity searches

An Effective NoSQL-Based Vector Map Tile Management Approach.

Vector and Line Quantization for Billion-scale Similarity Search on GPUs

Hash Learning with Variable Quantization for Large-scale Retrieval