A Latent Logistic Regression Model with Graph Data

Haixiang Zhang,Yingjun Deng,Alan J.X. Guo,Qing-Hu Hou,Ou Wu
DOI: https://doi.org/10.48550/arXiv.2210.05218
2022-10-11
Abstract:Recently, graph (network) data is an emerging research area in artificial intelligence, machine learning and statistics. In this work, we are interested in whether node's labels (people's responses) are affected by their neighbor's features (friends' characteristics). We propose a novel latent logistic regression model to describe the network dependence with binary responses. The key advantage of our proposed model is that a latent binary indicator is introduced to indicate whether a node is susceptible to the influence of its neighbour. A score-type test is proposed to diagnose the existence of network dependence. In addition, an EM-type algorithm is used to estimate the model parameters under network dependence. Extensive simulations are conducted to evaluate the performance of our method. Two public datasets are used to illustrate the effectiveness of the proposed latent logistic regression model.
Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in graph (network) data, whether the labels of nodes (such as people's responses) are affected by the characteristics of their neighbors (such as the characteristics of friends). Specifically, the authors focus on the existence and quantification of network dependence in binary response data. To explore this problem, they propose a logistic regression model with a latent binary indicator, which can describe whether a node is susceptible to the characteristics of its neighbors. In simpler terms, this paper mainly wants to figure out whether a person's behavior or preference in a social network or similar structure will change because of the behavior and preference of his or her friends or contacts, and detect and quantify this influence by proposing a new statistical model. ### Model Features 1. **Introduction of Latent Binary Indicator**: A latent variable \(\zeta_i\) is introduced into the model to represent whether the \(i\) -th node is sensitive to the characteristics of its neighbors. 2. **Network Dependence Detection**: A score - type test method is proposed to diagnose whether there is network dependence in the logistic regression model. 3. **Parameter Estimation**: The EM algorithm is used to estimate the model parameters, ensuring the consistency and good performance of the estimators. ### Mathematical Expression The specific form of the model is: \[ P(Y_i = 1 | X_i, \zeta_i) = \frac{\exp\left(\beta_0 + X_i'\beta + \delta \zeta_i \sum_{j = 1}^n a_{ij} X_j' \beta\right)}{1 + \exp\left(\beta_0 + X_i'\beta + \delta \zeta_i \sum_{j = 1}^n a_{ij} X_j' \beta\right)}, \] where: - \(Y_i\) is the binary label of the \(i\) -th node, - \(X_i\) is the feature vector of the \(i\) -th node, - \(\zeta_i\) is the latent binary indicator, - \(a_{ij}\) is an element of the adjacency matrix \(A\), indicating whether there is an edge connection between node \(i\) and node \(j\), - \(\beta_0\) is the intercept term, - \(\beta\) is the regression coefficient vector, - \(\delta\) represents the strength of a node's dependence on its neighbors. In addition, the probability distribution of the latent variable \(\zeta_i\) is: \[ P(\zeta_i = 1 | X_i) = \frac{\exp(\gamma_0 + X_i' \gamma)}{1 + \exp(\gamma_0 + X_i' \gamma)}. \] Through the above model, the authors can better understand the influence of network structure on node labels and provide effective tools to detect and quantify this influence.