Abstract:Due to the high complexity of constructing exact k -nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple partitions, and a subgraph is constructed for each partition before all the subgraphs are merged, reducing the memory pressure significantly. However, such solutions fail to address inefficiencies in large-scale k -nearest neighbor graph construction. In this paper, we propose L-FNNG, a novel solution for accelerating large-scale k -nearest neighbor graph construction on CPU-FPGA heterogeneous platform. The CPU is responsible for dividing data and determining the order of partition processing, while the FPGA executes all construction tasks to utilize the acceleration capability fully. To accelerate the execution of construction tasks, we design an efficient FPGA accelerator, which includes the Block-based Scheduling (BS) and Useless Computation Aborting (UCA) techniques to address the problems of memory access and computation in the NN-Descent algorithm. We also propose an efficient scheduling strategy that includes a KD-tree-based data partitioning method and a hierarchical processing method to address scheduling inefficiency. We evaluate L-FNNG on a Xilinx Alveo U280 board hosted by a 64-core Xeon server. On multiple large-scale datasets, L-FNNG achieves, on average, 2.3 × construction speedup over the state-of-the-art GPU-based solution.

What problem does this paper attempt to address?

The paper aims to address the efficiency issues in large-scale k-nearest neighbor (KNN) graph construction. Specifically, the paper proposes a new method called L-FNNG, which accelerates large-scale KNN graph construction on a CPU-FPGA heterogeneous platform through the following points: 1. **Algorithm Bottleneck Analysis**: - The paper first analyzes the performance bottlenecks of the existing NN-Descent algorithm on the CPU, finding that the computation phase occupies most of the execution time, indicating that high-dimensional vector operations become the main bottleneck. Additionally, due to irregular memory access patterns and a large amount of useless computation, the algorithm's execution efficiency is low. 2. **Problems with Existing Solutions**: - The current methods combining the "divide and conquer" strategy with the NN-Descent algorithm, although addressing the memory challenges brought by large-scale datasets, still have efficiency issues during the subgraph merging process. For example, randomly partitioning the dataset can lead to actual neighbor nodes being distributed in different partitions, resulting in most connections within each subgraph being between distant nodes. This not only increases the computational burden but also causes these edges to be replaced during the final merge. 3. **Core Contributions of L-FNNG**: - Proposes an efficient FPGA accelerator design, including Block-based Scheduling (BS) and Useless Computation Aborting (UCA) techniques, to optimize memory access and reduce unnecessary computation. - Develops a new scheduling strategy that uses a KD-tree to partition the dataset and employs a hierarchical processing method to ensure that nodes can establish connections with their nearest neighbors as early as possible, thereby accelerating the convergence process. - In experimental evaluations, L-FNNG achieved an average speedup of 2.3 times compared to the state-of-the-art GPU solutions on a 64-core Xeon server equipped with a Xilinx Alveo U280 card. In summary, L-FNNG effectively addresses the efficiency issues in large-scale KNN graph construction by optimizing the algorithm execution process, improving data scheduling strategies, and fully leveraging the parallel computing capabilities of FPGA.

L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform

HitGNN: High-Throughput GNN Training Framework on CPU+Multi-FPGA Heterogeneous Platform

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

An Efficient Mapping Approach To Large-Scale Dnns On Multi-Fpga Architectures

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Characterization of immunoglobulins from the brown bullhead (Ictalurus nebulosus) produced against a naturally occurring bacterial pathogen, Aeromonas hydrophila.

HP-GNN: Generating High Throughput GNN Training Implementation on CPU-FPGA Heterogeneous Platform

FP-DNN: an Automated Framework for Mapping Deep Neural Networks Onto FPGAs with RTL-HLS Hybrid Templates

An optimized architecture for accelerating graph computing on FPGAs

The Implementation of A Knn Classifier on Fpga with A Parallel and Pipelined Architecture Based on Predetermined Range Search

A Power Efficient Neural Network Implementation on Heterogeneous FPGA and GPU Devices

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Graph-OPU: A Highly Integrated FPGA-Based Overlay Processor for Graph Neural Networks

Mapping Large-Scale DNNs on Asymmetric FPGAs: (abstract Only).

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

Multi-clusters: an Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration

GNNHLS: Evaluating Graph Neural Network Inference via High-Level Synthesis