Abstract:For the degree corrected stochastic block model in the presence of arbitrary or even adversarial outliers, we develop a convex-optimization-based clustering algorithm that includes a penalization term depending on the positive deviation of a node from the expected number of edges to other inliers. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. Our synthetic experiments show that our algorithm performs well on heterogeneous networks, and in particular those with Pareto degree distributions, for which outliers have a broad range of possible degrees that may enhance their adversarial power. We also demonstrate that our method allows for recovery with significantly lower error rates compared to existing algorithms.
What problem does this paper attempt to address?
This paper attempts to solve the problem of clustering the Degree - Corrected Stochastic Block Model (DCSBM) with heterogeneous degree distributions in the presence of outliers. Specifically, the paper focuses on how to identify and group nodes in complex networks, especially ensuring the accuracy of clustering results in the presence of arbitrary or even adversarial outliers in the network.
### Main problem description
1. **Heterogeneous degree distribution**: In real - world networks, the degree distributions of nodes are usually heterogeneous, that is, the number of connections between different nodes varies greatly. For example, in social networks, some users may have thousands of followers, while others may have only a few. This heterogeneity poses challenges to clustering algorithms.
2. **Existence of outliers**: There may be some nodes in the network that do not belong to any cluster (outliers), and the connection patterns of these nodes may be arbitrary or even deliberately designed to confuse clustering algorithms. These outliers may significantly affect the quality of clustering results.
3. **Accurate recovery of clustering structure**: The goal of the paper is to develop an algorithm that can accurately recover the real clustering structure in the network under the above challenges and provide theoretical guarantees.
### Solution
To address these problems, the paper proposes a clustering algorithm based on convex optimization. The algorithm effectively deals with outliers by introducing a regularization term to penalize nodes that deviate from the expected connection patterns. Specifically:
- **Convex optimization framework**: The algorithm is based on the semidefinite programming (SDP) relaxation modulus maximization method.
- **Regularization term**: A regularization term \(\alpha\cdot\text{diag}(d^*)\) that depends on the node degrees is introduced, where \(d^*_i=\max(d_i, H^+)\), and \(H^+\) is the maximum of the expected number of connections of nodes. This regularization term can effectively penalize nodes that exhibit abnormal connection patterns.
### Theoretical guarantees
The paper provides a strict theoretical analysis and proves that under certain conditions, the algorithm can accurately recover the real clustering structure in the network with high probability. The key conditions include:
- **Density gap**: The gap between the density of intra - cluster edges and the density of inter - cluster edges must be large enough.
- **Parameter selection**: The regularization parameter \(\alpha\) needs to be large enough. Specifically, \(\alpha\geq c_1\frac{m}{H^-}\), where \(H^-\) is the minimum of the expected number of connections of intra - cluster nodes.
### Experimental verification
Through synthetic data experiments, the paper shows that the algorithm has better performance compared to existing algorithms when dealing with networks with heterogeneous degree distributions and a large number of outliers. The experimental results indicate that even when the network is very sparse or highly heterogeneous, the algorithm can still maintain high clustering accuracy.
In summary, this paper aims to solve the problem of accurately clustering the stochastic block model with heterogeneous degree distributions in the presence of outliers, and verifies the effectiveness of the proposed algorithm through theoretical analysis and experimental verification.