Abstract:Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its massive parallelism. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the dataset and the generated graph index to reside entirely in the GPU memory, or (ii) they partition the dataset into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. We introduce BANG, a first-of-its-kind technique for graph-based ANNS on GPU for billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing a compressed form of the dataset on a single GPU to perform distance computations while efficiently accessing the graph index kept on the host memory, enabling efficient ANNS on large graphs within the limited GPU memory. BANG incorporates highly optimized GPU kernels and proceeds in phases that run concurrently on the GPU and CPU. Notably, on the billion-size datasets, we achieve throughputs 40x-200x more than the competing methods for a high recall value of 0.9. Additionally, BANG is the best in cost- and power-efficiency among the competing methods from the recent Billion-Scale Approximate Nearest Neighbour Search Challenge.

Theano-based Large-Scale Visual Recognition with Multiple GPUs

InstantTrace: Fast Parallel Neuron Tracing on GPUs

Large Scale Recurrent Neural Network on GPU

Theano-MPI: a Theano-based Distributed Training Framework

Performance of Convolution Neural Network based on Multiple GPUs with Different Data Communication Models

Large Scale Artificial Neural Network Training Using Multi-GPUs

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

NeRF-XL: Scaling NeRFs with Multiple GPUs

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Towards Efficient Large-Scale Graph Neural Network Computing.

Performance Analysis of GPU-Based Convolutional Neural Networks

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Survey on Large Scale Neural Network Training

Data-parallel distributed training of very large models beyond GPU capacity

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Partial FC: Training 10 Million Identities on a Single Machine

TGL: A General Framework for Temporal GNN Training on Billion-Scale Graphs