Hyper-Connections

Defa Zhu,Hongzhi Huang,Zihao Huang,Yutao Zeng,Yunyao Mao,Banggu Wu,Qiyang Min,Xun Zhou
2024-09-29
Abstract:We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.
Machine Learning,Computation and Language,Computer Vision and Pattern Recognition,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address issues with residual connections in deep learning models, particularly the trade-off between gradient vanishing and representation collapse. The authors propose a new method—hyper-connections—to replace traditional residual connections. Specifically: 1. **Addressing the Limitations of Residual Connections**: - While residual connections effectively mitigate the gradient vanishing problem, their variants (such as Pre-Norm and Post-Norm) still have some limitations. - Pre-Norm solves the gradient vanishing problem but may lead to deep representation collapse. - Post-Norm alleviates the representation collapse issue but reintroduces gradient vanishing. 2. **Dynamically Adjusting Connection Strength**: - Hyper-connections allow the network to dynamically adjust the connection strength between features at different depths and can dynamically reorder layers. - This method not only automatically adjusts connection weights during training but also enables sequential or parallel configurations between layers. 3. **Experimental Validation**: - The authors validate the effectiveness of hyper-connections in pre-training large-scale language models (including dense and sparse models) through extensive experiments. - Experimental results show that in models with 1B and 7B parameters, hyper-connections significantly outperform traditional residual connections, particularly in terms of convergence speed and performance improvement. - Experiments on visual tasks also demonstrate similar advantages. ### Summary The core issue addressed by the paper is exploring a method that can autonomously learn the optimal connection strength to improve neural network performance. By proposing the new concept of hyper-connections, the authors demonstrate the advantages of this method across various tasks, especially in the significant effects observed in large-scale language model pre-training.