Abstract:In this paper, we build a TV-Stream, a high-performance graph processing system specific for a triangle counting algorithm on graph data with up to tens of billions of edges, which significantly exceeds the device memory capacity of Graphics Processing Units (GPUs). The triangle counting problem is a broad research topic in data mining and social network analysis in the graph processing field. As the scale of the graph data grows, a portion of the graph data must be loaded iteratively In the existing literature, graphs with billions of edges need to be done distributively, which is cost-intensive. Also, many disk-based triangle counting systems are proposed for CPU architectures, but their tackling performances are inefficient. To solve the above problem, we propose TC-Stream, and it focuses on three issues: 1) For power-law graphs, because the amount of tasks of each vertex or edge is inconsistent, it is bound to cause different demands of computing and memory resources for different task types. We propose a parallel vertex approach and the reordering of vertices for graph data that can be placed in the GPU device memory to ensure the maximum workload balancing; 2) A binary-search-based set intersection method is designed to achieve the maximum parallelism in GPU; 3) For the graph data that exceeds the GPU device memory capacity, we develop a novel vertical partition algorithm to guarantee the independent computing on each partition so that the three computation processes, i.e., the computation on GPU, the data transmission between main memory of CPU and SSD, and the communication between the CPU and the GPU can be perfectly overlapped. Moreover, the Ili-Stream optimizes edge-iterator models and benefits from multi-thread parallelism. Extensive experiments conducted on large-scale datasets showed that the 7C-stream running on a single Tesla V100 GPU performs 2.4 - 6x and 1.8 - 4.4 x faster than the state-of-the-art single-machine in-memory triangle counting system and GPU-based triangle counting system, respectively, and achieves 2.4x faster than the state-of-the-art out-of-core distributed system PDTL running on an 8-node cluster when processing the graph data with 42.5 billion edges, which demonstrates the high performance and cost-effectiveness of the TC-Stream.

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism.

Real-time viewing of large images based on multi-core

A Parallel GPU-Based Approach to Clustering Very Fast Data Streams

Parallel Processing of Dynamic Continuous Queries over Streaming Data Flows

Accelerating Geospatial Analysis on GPUs Using CUDA

Parallelization of the Kriging Algorithm in Stochastic Simulation with GPU Accelerators.

GPUSCAN: GPU-Based Parallel Structural Clustering Algorithm for Networks

Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

GPU-based Dynamic Quad Stream for Forest Rendering

Accelerating Genome-Wide Association Studies Using CUDA Compatible Graphics Processing Units

Parallelization of Spectral Clustering Algorithm on Multi-Core Processors and GPGPU

Accelerating Fast Fourier Transforms Using Hadoop and CUDA

Balancing Cpu And Gpu: Real-Time Visualization Of Large Scale 3d Scanning Models

An Incremental Iterative Acceleration Architecture in Distributed Heterogeneous Environments With GPUs for Deep Learning

TC-Stream: Large-Scale Graph Triangle Counting on a Single Machine Using GPUs

High-speed Visualization of Time-Varying Data in Large-Scale Structural Dynamic Analyses with a GPU

An efficient parallel ISODATA algorithm based on Kepler GPUs

GPUSCAN$^{++}$:Efficient Structural Graph Clustering on GPUs

Improving Barnes-Hut t-SNE Algorithm in Modern GPU Architectures with Random Forest KNN and Simulated Wide-Warp

A Parallel Scheme for Large-scale Polygon Rasterization on CUDA-enabled GPUs.

Large-scale FFT on GPU clusters