Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Weilin Cong,Rana Forsati,Mahmut Kandemir,Mehrdad Mahdavi
DOI: https://doi.org/10.48550/arXiv.2006.13866
2021-09-06
Abstract:Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the excessive computational burden caused by the dependency relationships between nodes and the resulting high - variance problem in the training of large - scale graph neural networks (GNNs). Specifically: 1. **Computational burden**: When training GNNs on large - scale graphs, the representation (embedding) of each node depends on the embeddings of its neighboring nodes. This dependency relationship grows exponentially as the number of layers increases, namely the "neighbor explosion" phenomenon, which makes the training of GNNs on large - scale graphs very difficult. 2. **High - variance problem**: Although existing sampling methods can accelerate the training of GNNs by reducing the amount of computation, these methods are usually based on graph - structure information and ignore the dynamics in the optimization process, resulting in high - variance estimated stochastic gradients, which in turn affect the convergence speed and generalization performance of the model. To solve the above problems, the authors propose a Minimal Variance Sampling (MVS) strategy. This strategy combines gradient information and embedding information and adaptively selects nodes with the minimum variance for sampling, thereby significantly reducing the variance of the stochastic gradients during the training process and improving the convergence speed and generalization ability of the model. ### Main contributions of the paper - **Theoretical analysis**: The authors theoretically analyze the variances of existing sampling methods and point out that the variance of any sampling method can be decomposed into the embedding approximation variance in the forward stage and the stochastic - gradient variance in the backward stage. To achieve a faster convergence rate, it is necessary to reduce both types of variances simultaneously. - **Algorithm design**: A decoupled variance - reduction strategy is proposed. By using (approximate) gradient information to adaptively sample nodes, the variance introduced by embedding approximation is explicitly reduced. - **Experimental verification**: It is proved theoretically and experimentally that the proposed MVS method can achieve a faster convergence speed and better generalization performance even when using a relatively small mini - batch size. ### Formula summary - **Objective function**: \[ f(\theta)=\mathbb{E}_{\omega_L}\left[f^{(L)}_{\omega_L}\left(\mathbb{E}_{\omega_{L - 1}}\left[f^{(L - 1)}_{\omega_{L - 1}}\left(\ldots\mathbb{E}_{\omega_1}\left[f^{(1)}_{\omega_1}(\theta)\right]\ldots\right)\right]\right)\right] \] where \(\omega_\ell\) represents the random variable for node sampling at the \(\ell\) - th layer. - **Gradient - variance decomposition**: \[ \mathbb{E}[\|\tilde{g}-\nabla f(\theta)\|^2]=\text{bias}(V)+\text{variance}(G) \] where \(\text{bias}(V)\) is the bias caused by the embedding approximation in the forward stage, and \(\text{variance}(G)\) is the standard variance caused by mini - batch sampling. - **Optimal sampling probability**: \[ p_i=\min\left(1,\frac{\bar{g}_i}{\mu}\right) \] where \(\bar{g}_i\) is the upper bound of the gradient norm of the \(i\) - th sample, and \(\mu\) is a threshold obtained by solving the following optimization problem: \[ \min_{p_i}\sum_{i = 1}^N\frac{\bar{g}_i^2}{p_i}\quad\text{subject to}\quad\sum_{i = 1}^Np_i = B,\quad p_i\in(0,1]\quad\forall i \] ### Conclusion By introducing the minimal - variance sampling strategy, this paper effectively solves the large - scale graph neural