Abstract:In recent years, Approximate Nearest Neighbor Search (ANNS) has played a pivotal role in modern search and recommendation systems, especially in emerging LLM applications like Retrieval-Augmented Generation. There is a growing exploration into harnessing the parallel computing capabilities of GPUs to meet the substantial demands of ANNS. However, existing systems primarily focus on offline scenarios, overlooking the distinct requirements of online applications that necessitate real-time insertion of new vectors. This limitation renders such systems inefficient for real-world scenarios. Moreover, previous architectures struggled to effectively support real-time insertion due to their reliance on serial execution streams. In this paper, we introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS). Our architecture achieves its objectives through three key advancements: 1) We initially examined the real-time insertion mechanisms in existing GPU ANNS systems and discovered their reliance on repetitive copying and memory allocation, which significantly hinders real-time effectiveness on GPUs. As a solution, we introduce a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement. 2) To enable real-time vector insertion in parallel, we introduce a multi-stream parallel execution mode, which differs from existing systems that operate serially within a single stream. Our system utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking. 3) Through extensive experiments and comparisons, our approach effectively handles varying QPS levels across different datasets, reducing latency by up to 40%-80%. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily, and has achieved good results.

DF-GAS: a Distributed FPGA-as-a-Service Architecture Towards Billion-Scale Graph-based Approximate Nearest Neighbor Search.

FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search

A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

Processing-In-Hierarchical-Memory Architecture for Billion-Scale Approximate Nearest Neighbor Search

NDSEARCH: Accelerating Graph-Traversal-Based Approximate Nearest Neighbor Search through Near Data Processing

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs

Bridging Software-Hardware for CXL Memory Disaggregation in Billion-Scale Nearest Neighbor Search

Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

Co-design Hardware and Algorithm for Vector Search

Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph.

L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform

Proxima: Near-storage Acceleration for Graph-based Approximate Nearest Neighbor Search in 3D NAND

SA-GNAS: Seed Architecture Expansion for Efficient Large-scale Graph Neural Architecture Search

Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search

Optimizing Graph-based Approximate Nearest Neighbor Search: Stronger and Smarter

DGNN-Booster: A Generic FPGA Accelerator Framework For Dynamic Graph Neural Network Inference

NDPGNN: A Near-Data Processing Architecture for GNN Training and Inference Acceleration

ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms

NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models