Abstract:Large-scale graph processing poses challenges due to its size and irregular memory access patterns, causing performance degradation in common architectures, such as CPUs and GPUs. Recent research includes accelerating graph processing using Field Programmable Gate Arrays (FPGAs). FPGAs can provide very efficient acceleration thanks to reconfigurable on-chip resources. Although limited, these resources offer a larger design space than CPUs and GPUs. We propose an approach in which data are preprocessed in small chunks with an optimized graph partitioning technique for execution on FPGA accelerators. The chunks, located on the host, are streamed directly into a customized memory layer implemented in the FPGA, which is tightly coupled with the processing elements responsible for the graph algorithm execution. This improves application memory access latency, which is crucial in large-sale graph computing performance. This work presents a hardware design that, combined with graph partitioning, enables us to achieve high-performance and potentially scalable handling of large graphs (i.e., graphs with millions of vertices and billions of edges in current scenarios) while using popular graph algorithms. The proposed framework accelerates performance 56 times compared with CPU (multicore with 16 logical cores in our reference experiments), 2.5 times and 4 times faster compared to state-of-the-art FPGA and GPU solutions (FPGA has 15 compute units, and GPU reference has 128 streaming-multiprocessors in our experiments), respectively, when using the PageRank algorithm. For the Single-Source-Shortest-Past (SSSP) algorithm, we achieve speedups of up to 65x, 26x, and 18x compared to CPU, GPU, and FPGA works, respectively. Lastly, in the context of the Weakly Connected Component (WCC) algorithm, our framework achieves a speedup of up to 403 times compared to the CPU, 7.4x against the GPU, and it is faster than the FPGA alternatives up to 10.3x.

Asynchronous Parallel Dijkstra's Algorithm on Intel Xeon Phi Processor - How to Accelerate Irregular Memory Access Algorithm.

Asynchronous Parallel Dijkstra’s Algorithm on Intel Xeon Phi Processor

A Parallel Dynamic Programming Algorithm on a Multi-Core Architecture

Enhanced OpenMP Algorithm to Compute All-Pairs Shortest Path on x86 Architectures

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

Energy-Aware Loop Parallelism Maximization for Multi-core DSP Architectures

Efficient Parallel D-Core Decomposition at Scale

Improving Performance of Dynamic Programming Via Parallelism and Locality on Multicore Architectures

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon Phi TM Coprocessor

Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU

Preliminary Investigation Of Accelerating Molecular Dynamics Simulation On Godson-T Many-Core Processor

DIMMining: Pruning-Efficient and Parallel Graph Mining on Near-Memory-Computing

Performance Evaluation of Parallel Graphs Algorithms Utilizing Graphcore IPU

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

A Case for In-Memory Random Scatter-Gather for Fast Graph Processing

An optimized architecture for accelerating graph computing on FPGAs

Investigating Memory Optimization of Hash-Index for Next Generation Sequencing on Multi-Core Architecture

Improving Performance of Dynamic Programming via Parallelism and Locality on

A real-time parallel implementation of Douglas-Peucker polyline simplification algorithm on shared memory multi-core processor computers

Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU