Abstract:We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address issues with residual connections in deep learning models, particularly the trade-off between gradient vanishing and representation collapse. The authors propose a new method—hyper-connections—to replace traditional residual connections. Specifically: 1. **Addressing the Limitations of Residual Connections**: - While residual connections effectively mitigate the gradient vanishing problem, their variants (such as Pre-Norm and Post-Norm) still have some limitations. - Pre-Norm solves the gradient vanishing problem but may lead to deep representation collapse. - Post-Norm alleviates the representation collapse issue but reintroduces gradient vanishing. 2. **Dynamically Adjusting Connection Strength**: - Hyper-connections allow the network to dynamically adjust the connection strength between features at different depths and can dynamically reorder layers. - This method not only automatically adjusts connection weights during training but also enables sequential or parallel configurations between layers. 3. **Experimental Validation**: - The authors validate the effectiveness of hyper-connections in pre-training large-scale language models (including dense and sparse models) through extensive experiments. - Experimental results show that in models with 1B and 7B parameters, hyper-connections significantly outperform traditional residual connections, particularly in terms of convergence speed and performance improvement. - Experiments on visual tasks also demonstrate similar advantages. ### Summary The core issue addressed by the paper is exploring a method that can autonomously learn the optimal connection strength to improve neural network performance. By proposing the new concept of hyper-connections, the authors demonstrate the advantages of this method across various tasks, especially in the significant effects observed in large-scale language model pre-training.

Hyper-Connections

Rethinking Residual Connection with Layer Normalization

Residual Enhanced Multi-Hypergraph Neural Network

Residual Connections Harm Generative Representation Learning

ResiDual: Transformer with Dual Residual Connections

Make Deep Networks Shallow Again

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Dynamic Hypergraph Neural Networks

Universal Deep GNNs: Rethinking Residual Connection in GNNs from a Path Decomposition Perspective for Preventing the Over-smoothing

Is the Skip Connection Provable to Reform the Neural Network Loss Landscape?

Gluing Neural Networks Symbolically Through Hyperdimensional Computing

Hypergraph Convolutional Networks via Equivalency between Hypergraphs and Undirected Graphs

A Quantitative Insight Into the Role of Skip Connections in Deep Neural Networks of Low Complexity: A Case Study Directed at Fluid Flow Modeling

Select, Attend, and Transfer: Light, Learnable Skip Connections

Why ResNet Works? Residuals Generalize

Towards Understanding the Importance of Shortcut Connections in Residual Networks

ResidualDroppath: Enhancing Feature Reuse over Residual Connections

Rewiring the Transformer with Depth-Wise LSTMs

Convolutional Networks with Dense Connectivity

Residual Hyperbolic Graph Convolution Networks

Deep Hypergraph Structure Learning