Abstract:Model pre-training on large text corpora has been demonstrated effective for various downstream applications in the NLP domain. In the graph mining domain, a similar analogy can be drawn for pre-training graph models on large graphs in the hope of benefiting downstream graph applications, which has also been explored by several recent studies. However, no existing study has ever investigated the pre-training of text plus graph models on large heterogeneous graphs with abundant textual information (a.k.a. large graph corpora) and then fine-tuning the model on different related downstream applications with different graph schemas. To address this problem, we propose a framework of graph-aware language model pre-training (GALM) on a large graph corpus, which incorporates large language models and graph neural networks, and a variety of fine-tuning methods on downstream applications. We conduct extensive experiments on Amazon's real internal datasets and large public datasets. Comprehensive empirical results and in-depth analysis demonstrate the effectiveness of our proposed methods along with lessons learned.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform graph - aware language model pre - training on large - scale graph corpora and apply it to multiple downstream graph application tasks. Specifically, the paper focuses on how to pre - train a model on a large - scale heterogeneous graph (i.e., a large - scale graph corpus) containing multiple entity types and their rich text information, and then how to transfer this pre - trained model to multiple downstream applications with different graph patterns to improve the performance of these applications. The background of this problem is that most of the existing research focuses on pre - training and transfer learning on pure text data or a single graph pattern, without exploring pre - training methods on large - scale graph corpora that contain both text and graph structure information, especially when these downstream applications may have edge patterns different from those of the pre - trained graph. The method proposed in the paper is to design a graph - aware language model pre - training framework (GaLM), which can combine the advantages of large - scale language models (LMs) and graph neural networks (GNNs) and use the graph information in large - scale graph corpora to enhance the understanding ability of LMs. In addition, the paper also explores different fine - tuning strategies in order to effectively apply the pre - trained model to various downstream tasks, such as link prediction, query - product matching, and multi - label node classification. Through extensive experiments on Amazon's internal datasets and public datasets, the authors verify the effectiveness of the proposed pre - training and fine - tuning strategies and provide in - depth empirical analysis and lessons learned. This not only shows the potential of GaLM in promoting multiple downstream graph applications but also provides a new direction for future research.

Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications

Large Language Models on Graphs: A Comprehensive Survey

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? an Empirical Evaluation and Benchmarking.

Graph Learning and Its Advancements on Large Language Models: A Holistic Survey

Enhance Graph Alignment for Large Language Models

All in One and One for All: A Simple yet Effective Method towards Cross-domain Graph Pretraining

LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model

Integrating Graphs With Large Language Models: Methods and Prospects

Efficient and effective training of language and graph neural network models

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

A Survey of Large Language Models for Graphs

Graph Pre-training for AMR Parsing and Generation

A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications

GraphFM: A Scalable Framework for Multi-Graph Pretraining

Scalable Multi-Source Pre-training for Graph Neural Networks

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Text-Free Multi-domain Graph Pre-training: Toward Graph Foundation Models

Strategies for Pre-training Graph Neural Networks

Advancing Graph Representation Learning with Large Language Models: A Comprehensive Survey of Techniques