Abstract:With the recent popularity of neural networks comes the need for efficient serving of inference workloads. A neural network inference workload can be represented as a computational graph with nodes as operators transforming multidimensional tensors. The tensors can be transposed and/or tiled in a combinatorially large number of ways, some configurations leading to accelerated inference. We propose TGraph, a neural graph architecture that allows screening for fast configurations of the target computational graph, thus representing an artificial intelligence (AI) tensor compiler in contrast to the traditional heuristics-based compilers. The proposed solution improves mean Kendall's $\tau$ across layout collections of TpuGraphs from 29.8% of the reliable baseline to 67.4% of TGraph. We estimate the potential CO$_2$ emission reduction associated with our work to be equivalent to over 50% of the total household emissions in the areas hosting AI-oriented data centers.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to optimize the execution efficiency of neural network inference workloads. Specifically, the author proposes a new architecture based on Graph Neural Networks (GNN) - TGraph, which aims to screen out fast configuration schemes for the target computational graph, thereby achieving an efficient tensor compiler. ### Main Problem Description With the popularity of neural networks, it has become crucial to efficiently provide inference services. The inference workload of a neural network can be represented as a computational graph, where nodes are operators for transforming multi - dimensional tensors. These tensors can be transposed and/or sliced in multiple ways, and certain configurations can significantly accelerate the inference process. However, traditional methods mainly rely on heuristic rules to select the best configuration. Although this method is fast, it cannot achieve the absolute optimal running time. ### Solution The author proposes TGraph, a Graph Neural Network architecture with a configuration cross - attention mechanism. By learning the impact of different configurations on the performance of the computational graph, TGraph can more accurately predict the best configuration, thereby achieving better performance than traditional methods. Specifically, TGraph has made improvements in the following aspects: 1. **Introduced the cross - channel self - attention mechanism**: This helps to capture the correlations between different channels and enhance important features. 2. **Introduced the cross - configuration attention mechanism**: This enables the model to explicitly compare different configurations in the entire batch, improving the performance of the ranking task. 3. **Data pre - processing and optimization**: Including techniques such as node pruning, configuration deduplication, and compression, which reduce memory usage and accelerate the training speed. ### Experimental Results Experiments show that TGraph has achieved results significantly better than existing methods on multiple benchmark datasets. In particular, its performance on the TpuGraphs dataset is particularly outstanding, with the average Kendall’s τ increasing from 29.8% in the baseline to 67.4%, and the estimated potential CO₂ emission reduction is equivalent to more than 50% of household emissions. ### Social Impact This research also emphasizes the importance of optimizing AI workloads in data centers. According to estimates, by improving the execution efficiency of AI workloads, energy consumption and carbon dioxide emissions can be significantly reduced, thereby combating climate change. In conclusion, this paper successfully solves the key problems in the optimization of neural network inference workloads by proposing the innovative Graph Neural Network architecture TGraph, and shows its great potential in practical applications.

Graph neural networks with configuration cross-attention for tensor compilers

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation

GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor Cores

Optimizing Tensor Computation Graphs with Equality Saturation and Monte Carlo Tree Search

Accelerating Graph Neural Networks with a Novel Matrix Compression Format

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

A Data-Driven Approach to Dataflow-Aware Online Scheduling for Graph Neural Network Inference

Efficient Graph Neural Network Inference at Large Scale

TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs

NTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring

Using Graph Neural Networks to model the performance of Deep Neural Networks

Graph Neural Network Training with Data Tiering

TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

SpikeGraphormer: A High-Performance Graph Transformer with Spiking Graph Attention

Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective

CoGNN: An Algorithm-Hardware Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling

InferTurbo: A Scalable System for Boosting Full-graph Inference of Graph Neural Network over Huge Graphs

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU

Towards Efficient Point Cloud Graph Neural Networks Through Architectural Simplification