Abstract:Graph Neural Networks have been widely employed for multimodal fusion and embedding. To overcome over-smoothing issue, residual connections, which are designed for alleviating vanishing gradient problem in NNs, are adopted in Graph Neural Networks (GNNs) to incorporate local node information. However, these simple residual connections are ineffective on networks with heterophily, since the roles of both convolutional operations and residual connections in GNNs are significantly different from those in classic NNs. By considering the specific smoothing characteristic of graph convolutional operation, deep layers in GNNs are expected to focus on the data which can't be properly handled in shallow layers. To this end, a novel and universal Difference Residual Connections (DRC), which feed the difference of the output and input of previous layer as the input of the next layer, is proposed. Essentially, Difference Residual Connections is equivalent to inserting layers with opposite effect (e.g., sharpening) into the network to prevent the excessive effect (e.g., over-smoothing issue) induced by too many layers with the similar role (e.g., smoothing) in GNNs. From the perspective of optimization, DRC is the gradient descent method to minimize an objective function with both smoothing and sharpening terms. The analytic solution to this objective function is determined by both graph topology and node attributes, which theoretically proves that DRC can prevent over-smoothing issue. Extensive experiments demonstrate the superiority of DRC on real networks with both homophily and heterophily, and show that DRC can automatically determine the model depth and be adaptive to both shallow and deep models with two complementary components.

Rethinking Residual Connection with Layer Normalization

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Is the Skip Connection Provable to Reform the Neural Network Loss Landscape?

Understanding and Improving Layer Normalization

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Mitigating Gradient Overlap in Deep Residual Networks with Gradient Normalization for Improved Non-Convex Optimization

On the importance of network architecture in training very deep neural networks

Towards Understanding the Importance of Shortcut Connections in Residual Networks

Rethinking skip connection model as a learnable Markov chain

Theoretical analysis of skip connections and batch normalization from generalization and optimization perspectives

A Quantitative Insight Into the Role of Skip Connections in Deep Neural Networks of Low Complexity: A Case Study Directed at Fluid Flow Modeling

Hyper-Connections

Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes

Peeking Behind the Curtains of Residual Learning

Why ResNet Works? Residuals Generalize

Normalized Activation Function: Toward Better Convergence

ZNorm: Z-Score Gradient Normalization Accelerating Skip-Connected Network Training without Architectural Modification

Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

Difference Residual Graph Neural Networks

Evolving Normalization-Activation Layers

Residual Connections Harm Generative Representation Learning