Resampled Mutual Information for Clustering and Community Detection

Cheaheon Lim
2024-11-21
Abstract:We introduce resampled mutual information (ResMI), a novel measure of clustering similarity that combines insights from information theoretic and pair counting approaches to clustering and community detection. Similar to chance-corrected measures, ResMI satisfies the constant baseline property, but it has the advantages of not requiring adjustment terms and being fully interpretable in the language of information theory. Experiments on synthetic datasets demonstrate that ResMI is robust to common biases exhibited by existing measures, particularly in settings with high cluster counts and asymmetric cluster distributions. Additionally, we show that ResMI identifies meaningful community structures in two real contact tracing networks.
Social and Information Networks,Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the biases and limitations in the existing clustering similarity measurement methods. Specifically, the author points out: 1. **Constant Baseline Property**: Existing measurement methods such as mutual information (MI), normalized mutual information (NMI), and Rand index (RI) fail to satisfy this property, that is, the expected similarity between two independently generated clustering results should be zero. 2. **Cluster Count Bias**: Methods such as NMI tend to overestimate the similarity of the number of clusters with more clusters, which leads to inaccurate results. 3. **Symmetry Bias**: Some methods perform poorly when dealing with asymmetrically distributed clusters, such as the Rand index and its adjusted version ARI. 4. **Model Dependence**: Adjusted mutual information (such as AMI) and other methods based on random correction depend on specific randomization models, which affects their universality and interpretability. To solve these problems, the author introduced a new clustering similarity measure - Resampled Mutual Information (ResMI). ResMI combines the advantages of information theory and pairwise counting methods, and: - Satisfies the constant baseline property. - Does not require adjustment terms, thus maintaining model independence. - Is completely defined in the language of information theory, ensuring its interpretability and intuitiveness. Through experiments on synthetic data sets and actual contact - tracing networks, the author demonstrated the superiority of ResMI in dealing with the above problems, especially in dealing with high - cluster numbers and asymmetric cluster distributions. ### Formula Summary - **Mutual Information (MI)**: \[ I(P_f; P_g)=\sum_{m,m'} P_{f,g}(m, m') \log \frac{P_{f,g}(m, m')}{P_f(m) P_g(m')} \] where \(P_{f,g}(m, m')=\frac{| \{ i \in [n] : f(i) = m, g(i) = m' \} |}{n}\), \(P_f(m)=\frac{| \{ i \in [n] : f(i) = m \} |}{n}\), \(P_g(m')=\frac{| \{ i \in [n] : g(i) = m' \} |}{n}\). - **Normalized Mutual Information (NMI)**: \[ \text{NMI}(f; g)=\frac{I(P_f; P_g)}{\frac{1}{2}(H(P_f)+H(P_g))} \] where \(H(P_f)=-\sum_m P_f(m) \log P_f(m)\). - **Adjusted Mutual Information (AMI)**: \[ \text{AMI}(f, g)=\frac{\text{NMI}(f, g)-E[\text{NMI}(f, g)]}{1 - E[\text{NMI}(f, g)]} \] - **Reduced Mutual Information (RMI)**: \[ \text{RMI}=I(f; g)-\frac{1}{n} \log \Omega(f; g) \] where \(\Omega(f; g)\) is the number of non - negative integer matrices with the same marginals as Table I. - **Resampled Mutual Information (ResMI)**: