Abstract:Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the excessive computational burden caused by the dependency relationships between nodes and the resulting high - variance problem in the training of large - scale graph neural networks (GNNs). Specifically: 1. **Computational burden**: When training GNNs on large - scale graphs, the representation (embedding) of each node depends on the embeddings of its neighboring nodes. This dependency relationship grows exponentially as the number of layers increases, namely the "neighbor explosion" phenomenon, which makes the training of GNNs on large - scale graphs very difficult. 2. **High - variance problem**: Although existing sampling methods can accelerate the training of GNNs by reducing the amount of computation, these methods are usually based on graph - structure information and ignore the dynamics in the optimization process, resulting in high - variance estimated stochastic gradients, which in turn affect the convergence speed and generalization performance of the model. To solve the above problems, the authors propose a Minimal Variance Sampling (MVS) strategy. This strategy combines gradient information and embedding information and adaptively selects nodes with the minimum variance for sampling, thereby significantly reducing the variance of the stochastic gradients during the training process and improving the convergence speed and generalization ability of the model. ### Main contributions of the paper - **Theoretical analysis**: The authors theoretically analyze the variances of existing sampling methods and point out that the variance of any sampling method can be decomposed into the embedding approximation variance in the forward stage and the stochastic - gradient variance in the backward stage. To achieve a faster convergence rate, it is necessary to reduce both types of variances simultaneously. - **Algorithm design**: A decoupled variance - reduction strategy is proposed. By using (approximate) gradient information to adaptively sample nodes, the variance introduced by embedding approximation is explicitly reduced. - **Experimental verification**: It is proved theoretically and experimentally that the proposed MVS method can achieve a faster convergence speed and better generalization performance even when using a relatively small mini - batch size. ### Formula summary - **Objective function**: \[ f(\theta)=\mathbb{E}_{\omega_L}\left[f^{(L)}_{\omega_L}\left(\mathbb{E}_{\omega_{L - 1}}\left[f^{(L - 1)}_{\omega_{L - 1}}\left(\ldots\mathbb{E}_{\omega_1}\left[f^{(1)}_{\omega_1}(\theta)\right]\ldots\right)\right]\right)\right] \] where \(\omega_\ell\) represents the random variable for node sampling at the \(\ell\) - th layer. - **Gradient - variance decomposition**: \[ \mathbb{E}[\|\tilde{g}-\nabla f(\theta)\|^2]=\text{bias}(V)+\text{variance}(G) \] where \(\text{bias}(V)\) is the bias caused by the embedding approximation in the forward stage, and \(\text{variance}(G)\) is the standard variance caused by mini - batch sampling. - **Optimal sampling probability**: \[ p_i=\min\left(1,\frac{\bar{g}_i}{\mu}\right) \] where \(\bar{g}_i\) is the upper bound of the gradient norm of the \(i\) - th sample, and \(\mu\) is a threshold obtained by solving the following optimization problem: \[ \min_{p_i}\sum_{i = 1}^N\frac{\bar{g}_i^2}{p_i}\quad\text{subject to}\quad\sum_{i = 1}^Np_i = B,\quad p_i\in(0,1]\quad\forall i \] ### Conclusion By introducing the minimal - variance sampling strategy, this paper effectively solves the large - scale graph neural

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Bandit Samplers for Training Graph Neural Networks

Stochastic Training of Graph Convolutional Networks with Variance Reduction

Provably Convergent Subgraph-wise Sampling for Fast GNN Training

Graph Sampling for Scalable and Expressive Graph Neural Networks on Homophilic Graphs

Feature-Oriented Sampling for Fast and Scalable GNN Training.

Hierarchical Estimation for Effective and Efficient Sampling Graph Neural Network

LMC: Fast Training of GNNs via Subgraph Sampling with Provable Convergence

A Local Graph Limits Perspective on Sampling-Based GNNs

ScatterSample: Diversified Label Sampling for Data Efficient Graph Neural Network Learning

A learnable sampling method for scalable graph neural networks

Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining

Adaptive Sampling Towards Fast Graph Representation Learning

Evaluating graph neural networks under graph sampling scenarios

Distributed Matrix-Based Sampling for Graph Neural Network Training

Sketch-GNN: Scalable Graph Neural Networks with Sublinear Training Complexity

Learning Stochastic Graph Neural Networks With Constrained Variance

Variants for Advanced Training Methods

Adaptive Sampling Temporal Graph Network

Layer-diverse Negative Sampling for Graph Neural Networks

Adaptive Sampling for Deep Learning via Efficient Nonparametric Proxies