Graph Knowledge Distillation to Mixture of Experts

Pavel Rumiantsev,Mark Coates
2024-06-17
Abstract:In terms of accuracy, Graph Neural Networks (GNNs) are the best architectural choice for the node classification task. Their drawback in real-world deployment is the latency that emerges from the neighbourhood processing operation. One solution to the latency issue is to perform knowledge distillation from a trained GNN to a Multi-Layer Perceptron (MLP), where the MLP processes only the features of the node being classified (and possibly some pre-computed structural information). However, the performance of such MLPs in both transductive and inductive settings remains inconsistent for existing knowledge distillation techniques. We propose to address the performance concerns by using a specially-designed student model instead of an MLP. Our model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization. By encouraging each expert to specialize on a certain region on the hidden representation space, we demonstrate experimentally that it is possible to derive considerably more consistent performance across multiple datasets.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main aim of this paper is to address the latency issues faced by Graph Neural Networks (GNNs) in real-world deployments, especially when handling large-scale graph data. Specifically, the objectives of the paper can be summarized as follows: 1. **Addressing the latency issue of GNNs**: Although GNNs perform excellently on node classification tasks, they require processing the neighbor information of nodes at each layer to compute the prediction results. This leads to high computational complexity and long inference times. Particularly in large graphs, this neighborhood processing operation results in resource-intensive operations. 2. **Improving knowledge distillation techniques**: To address the above issue, previous works have attempted to transfer the knowledge of GNNs to Multi-Layer Perceptrons (MLPs) through knowledge distillation. This leverages the efficiency and scalability advantages of MLPs. However, existing knowledge distillation techniques show inconsistent performance under different settings (such as inductive and transductive settings), especially for large graph datasets. 3. **Proposing a new student model**: The paper proposes a new architecture called Routing-by-Memory (RbM) as the student model to address the above issues. RbM is a variant of the Mixture-of-Experts (MoE) model, which encourages each expert to specialize in specific regions of the network's hidden representation space, thereby achieving more consistent performance improvements. 4. **Enhancing the performance consistency of the student model**: By using the RbM model instead of traditional MLPs, the paper demonstrates that it is possible to significantly enhance the performance consistency of the student model even with a fixed number of parameters. Furthermore, through a series of experiments, the paper proves that the proposed RbM method effectively improves performance across datasets of different sizes and outperforms existing baseline models. In summary, the main objective of the paper is to improve existing knowledge distillation techniques by introducing a new student model—RbM. This aims to more effectively transfer knowledge from GNNs to MLPs, thereby addressing the latency issues of GNNs in practical applications and enhancing the performance consistency of the student model under different settings.