A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

Yun Wang,Chrysanthi Kosyfaki,Sihem Amer-Yahia,Reynold Cheng
2024-03-20
Abstract:Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m- dimensional random walk that accounts for the paths specified in a hypothesis. We further optimize its time efficiency and propose PHASEopt. Experiments on real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling in terms of accuracy and time efficiency.
Machine Learning,Databases
What problem does this paper attempt to address?
This paper focuses on the hypothesis testing problem on large attributed graphs, which is a statistical method for drawing conclusions about the population from sample data, usually represented in tabular form. With the popularity of graph representation in real-world applications, hypothesis testing in graphs has become increasingly important. Researchers formalize node, edge, and path hypotheses and develop a sampling-based hypothesis testing framework that accommodates existing hypothesis-agnostic graph sampling methods. They propose a new approach called Path-Hypothesis-Aware SamplEr (PHASE) that considers hypotheses and preserves nodes, edges, or paths related to them to improve the accuracy and efficiency of sampling. Specifically, the main contributions of the paper include: 1. For attributed graphs, hypotheses are divided into node hypotheses, edge hypotheses, and path hypotheses. 2. A general sampling framework is designed that can adopt common hypothesis-agnostic sampling methods such as random node sampling, simple random walk, and non-backtracking random walk. 3. A new sampler called PHASE is proposed, which is aware of path hypotheses and preserves paths related to hypotheses through m-dimensional random walks. 4. Further optimization of PHASE, called PHASE opt, is achieved by sampling neighbors and using non-backtracking methods to improve time efficiency. 5. Experiments are conducted on three real-world datasets, demonstrating the advantages of PHASE opt over other methods in terms of accuracy, time, and test significance. The experimental results show that PHASE opt is at least 20 times faster than PHASE and achieves the best accuracy in various types of hypotheses and datasets, especially when dealing with long paths or limited related nodes, edges, and paths. Furthermore, PHASE opt has faster convergence speed and higher efficiency while ensuring accuracy. In conclusion, this paper addresses the problem of efficient and accurate hypothesis testing on large attributed graphs by designing a new path-aware sampling method that improves the performance of testing.