Graph as a feature: improving node classification with non-neural graph-aware logistic regression

Simon Delarue,Thomas Bonald,Tiphaine Viard
2024-11-19
Abstract:Graph Neural Networks (GNNs) and their message passing framework that leverages both structural and feature information, have become a standard method for solving graph-based machine learning problems. However, these approaches still struggle to generalise well beyond datasets that exhibit strong homophily, where nodes of the same class tend to connect. This limitation has led to the development of complex neural architectures that pose challenges in terms of efficiency and scalability. In response to these limitations, we focus on simpler and more scalable approaches and introduce Graph-aware Logistic Regression (GLR), a non-neural model designed for node classification tasks. Unlike traditional graph algorithms that use only a fraction of the information accessible to GNNs, our proposed model simultaneously leverages both node features and the relationships between entities. However instead of relying on message passing, our approach encodes each node's relationships as an additional feature vector, which is then combined with the node's self attributes. Extensive experimental results, conducted within a rigorous evaluation framework, show that our proposed GLR approach outperforms both foundational and sophisticated state-of-the-art GNN models in node classification tasks. Going beyond the traditional limited benchmarks, our experiments indicate that GLR increases generalisation ability while reaching performance gains in computation time up to two orders of magnitude compared to it best neural competitor.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient generalization ability of Graph Neural Networks (GNNs) when dealing with datasets with diverse characteristics. Specifically, existing GNNs perform well when dealing with networks that exhibit strong homogeneity (i.e., networks where nodes of the same type tend to be connected), but encounter difficulties when dealing with networks that exhibit heterogeneity (i.e., networks where nodes of different types tend to be connected). This has led to complex neural architecture designs, which, although improving performance, also pose challenges in terms of efficiency and scalability. Moreover, traditional graph algorithms, although simple and efficient, can usually only utilize a small part of the information in the graph structure or node attributes and are unable to fully utilize all the information accessible to GNNs. In response to these problems, the paper proposes a non - neural model named **Graph - aware Logistic Regression (GLR)**, aiming to combine graph structure and node feature information through a simple and efficient mechanism to solve the node classification task. GLR encodes the neighborhood relationship of each node into an additional feature vector and combines it with the node's own attributes, thereby utilizing both node features and the relationships between entities simultaneously. Experimental results show that GLR not only outperforms basic and advanced GNN models in node classification tasks but also achieves an improvement of up to two orders of magnitude in computation time. ### Main contributions of the paper: 1. **Introduction of Graph - aware Logistic Regression (GLR)**: This is a simple non - neural model that can consider both the topological structure of the graph and node attributes as features for solving the node classification task. 2. **Superior performance**: Under a strict evaluation framework, GLR not only outperforms GNNs in performance but also exhibits better generalization ability. 3. **High computational efficiency**: Compared with top - level GNNs, GLR significantly reduces the computation time while maintaining high performance. 4. **Extension of the analysis of homogeneity**: In addition to the common label homophily, the paper also introduces feature homophily to more comprehensively explain the factors influencing model performance. ### Solutions: - **GLR model**: By concatenating the neighborhood representation of a node with its original feature vector and then inputting it into a logistic regression model, GLR can utilize both graph structure and node feature information simultaneously. - **Experimental design**: To fairly evaluate model performance, the paper adopts k - fold cross - validation and tests on a variety of real - world networks with different characteristics (such as size, density, and homogeneity). - **Baseline models**: The paper compares GLR with multiple GNN models (such as GCN, GraphSage, GAT, etc.) and non - neural models (such as diffusion models, KNN, logistic regression, etc.). ### Experimental results: - **Performance of traditional non - neural models**: On certain specific graphs, simple non - neural methods (such as Actor, Cornell, Wisconsin, and Wikivitals+) even outperform recent GNNs. - **Performance of GLR**: GLR performs excellently on multiple datasets, not only outperforming GNNs in accuracy but also achieving a significant improvement in computation time. Through these contributions, the paper provides a new perspective, demonstrating that simple and efficient non - neural models can also achieve excellent performance when dealing with complex graph data.