Homophily modulates double descent generalization in graph convolution networks

Cheng Shi,Liming Pan,Hong Hu,Ivan Dokmanić
2024-01-23
Abstract:Graph neural networks (GNNs) excel in modeling relational data such as biological, social, and transportation networks, but the underpinnings of their success are not well understood. Traditional complexity measures from statistical learning theory fail to account for observed phenomena like the double descent or the impact of relational semantics on generalization error. Motivated by experimental observations of ``transductive'' double descent in key networks and datasets, we use analytical tools from statistical physics and random matrix theory to precisely characterize generalization in simple graph convolution networks on the contextual stochastic block model. Our results illuminate the nuances of learning on homophilic versus heterophilic data and predict double descent whose existence in GNNs has been questioned by recent work. We show how risk is shaped by the interplay between the graph noise, feature noise, and the number of training labels. Our findings apply beyond stylized models, capturing qualitative trends in real-world GNNs and datasets. As a case in point, we use our analytic insights to improve performance of state-of-the-art graph convolution networks on heterophilic datasets.
Machine Learning,Disordered Systems and Neural Networks
What problem does this paper attempt to address?
This paper attempts to solve the generalization problem of graph neural networks (GNNs) in semi - supervised node classification tasks. Specifically, the paper focuses on how homophily affects the double - descent phenomenon in GNNs. Complexity measures in traditional statistical learning theory cannot explain the observed phenomena, such as the influence of double - descent or relational semantics on the generalization error. To further study these issues, the authors use analytical tools from statistical physics and random matrix theory to accurately characterize the generalization performance of simple graph convolutional networks (GCNs) on the context - stochastic block model (CSBM). ### Main contributions of the paper: 1. **Accurate characterization of generalization performance**: Through analytical tools, the authors are able to accurately describe the generalization behavior of simple GCNs on CSBM, especially the influence of homogeneous and heterogeneous data on the generalization error. 2. **Explanation of the double - descent phenomenon**: The paper discusses in detail the existence of the double - descent phenomenon in GNNs and explains why, in some cases, more training labels can actually harm the generalization performance. 3. **The influence of self - loops**: The study finds that self - loops can improve performance when dealing with homogeneous data, but may have a negative impact when dealing with heterogeneous data. Introducing negative self - loops can improve performance on heterogeneous datasets. 4. **Combination of theory and experiment**: The authors not only provide theoretical analysis, but also verify these theoretical results through experiments, demonstrating their applicability on actual datasets. ### Key questions: - **Double - descent phenomenon**: Why is the double - descent phenomenon observed in GNNs? How does this phenomenon behave in different types of graph data (homogeneous and heterogeneous)? - **Homogeneity and heterogeneity**: What is the impact of homogeneous and heterogeneous data on the generalization performance of GNNs? How can model design be optimized for these two types of data? - **The role of self - loops**: What is the role of self - loops in GNNs? Why are they effective in homogeneous data but may be ineffective in heterogeneous data? ### Theoretical analysis: - **CSBM model**: The authors use the context - stochastic block model (CSBM) as the research object, which can generate graph data with specific homogeneity or heterogeneity. - **Risk expression**: By analyzing simple GCNs on CSBM, the authors derive accurate expressions for the test risk, which can explain the existence of the double - descent phenomenon and its performance under different conditions. - **The influence of self - loops**: By introducing self - loops, the authors show how to adjust the model to adapt to different types of graph data, especially that introducing negative self - loops in heterogeneous datasets can significantly improve performance. ### Experimental verification: - **Experimental setup**: The authors conduct experiments on multiple datasets, including homogeneous datasets (such as Cora) and heterogeneous datasets (such as Chameleon and Texas). - **Experimental results**: The experimental results verify the correctness of the theoretical analysis, especially that the performance of the double - descent phenomenon under different noise levels and different self - loop settings is consistent with the theoretical prediction. In conclusion, through theoretical analysis and experimental verification, this paper deeply explores the generalization performance of GNNs in semi - supervised node classification tasks, especially the influence of homogeneous and heterogeneous data on the double - descent phenomenon, providing an important theoretical basis and practical guidance for understanding and optimizing the performance of GNNs in practical applications.