Abstract:We introduce resampled mutual information (ResMI), a novel measure of clustering similarity that combines insights from information theoretic and pair counting approaches to clustering and community detection. Similar to chance-corrected measures, ResMI satisfies the constant baseline property, but it has the advantages of not requiring adjustment terms and being fully interpretable in the language of information theory. Experiments on synthetic datasets demonstrate that ResMI is robust to common biases exhibited by existing measures, particularly in settings with high cluster counts and asymmetric cluster distributions. Additionally, we show that ResMI identifies meaningful community structures in two real contact tracing networks.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the biases and limitations in the existing clustering similarity measurement methods. Specifically, the author points out: 1. **Constant Baseline Property**: Existing measurement methods such as mutual information (MI), normalized mutual information (NMI), and Rand index (RI) fail to satisfy this property, that is, the expected similarity between two independently generated clustering results should be zero. 2. **Cluster Count Bias**: Methods such as NMI tend to overestimate the similarity of the number of clusters with more clusters, which leads to inaccurate results. 3. **Symmetry Bias**: Some methods perform poorly when dealing with asymmetrically distributed clusters, such as the Rand index and its adjusted version ARI. 4. **Model Dependence**: Adjusted mutual information (such as AMI) and other methods based on random correction depend on specific randomization models, which affects their universality and interpretability. To solve these problems, the author introduced a new clustering similarity measure - Resampled Mutual Information (ResMI). ResMI combines the advantages of information theory and pairwise counting methods, and: - Satisfies the constant baseline property. - Does not require adjustment terms, thus maintaining model independence. - Is completely defined in the language of information theory, ensuring its interpretability and intuitiveness. Through experiments on synthetic data sets and actual contact - tracing networks, the author demonstrated the superiority of ResMI in dealing with the above problems, especially in dealing with high - cluster numbers and asymmetric cluster distributions. ### Formula Summary - **Mutual Information (MI)**: \[ I(P_f; P_g)=\sum_{m,m'} P_{f,g}(m, m') \log \frac{P_{f,g}(m, m')}{P_f(m) P_g(m')} \] where \(P_{f,g}(m, m')=\frac{| \{ i \in [n] : f(i) = m, g(i) = m' \} |}{n}\), \(P_f(m)=\frac{| \{ i \in [n] : f(i) = m \} |}{n}\), \(P_g(m')=\frac{| \{ i \in [n] : g(i) = m' \} |}{n}\). - **Normalized Mutual Information (NMI)**: \[ \text{NMI}(f; g)=\frac{I(P_f; P_g)}{\frac{1}{2}(H(P_f)+H(P_g))} \] where \(H(P_f)=-\sum_m P_f(m) \log P_f(m)\). - **Adjusted Mutual Information (AMI)**: \[ \text{AMI}(f, g)=\frac{\text{NMI}(f, g)-E[\text{NMI}(f, g)]}{1 - E[\text{NMI}(f, g)]} \] - **Reduced Mutual Information (RMI)**: \[ \text{RMI}=I(f; g)-\frac{1}{n} \log \Omega(f; g) \] where \(\Omega(f; g)\) is the number of non - negative integer matrices with the same marginals as Table I. - **Resampled Mutual Information (ResMI)**:

Resampled Mutual Information for Clustering and Community Detection

PMINR: Pointwise Mutual Information-Based Network Regression – with Application to Studies of Lung Cancer and Alzheimer’s Disease

Mutual Information Multinomial Estimation

FastAMI -- a Monte Carlo Approach to the Adjustment for Chance in Clustering Comparison Metrics

Independence test via mutual information in the presence of measurement errors

Normalized mutual information is a biased measure for classification and community detection

Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering

Adjusting for Chance Clustering Comparison Measures

Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric

A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets

Approximating mutual information of high-dimensional variables using learned representations

MINDE: Mutual Information Neural Diffusion Estimation

The Generalized Mean Information Coefficient

Evaluating Summary Statistics with Mutual Information for Cosmological Inference

The Impact of Random Models on Standardized Clustering Similarity

DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

Robust computation of mutual information using spatially adaptive meshes

On the Effect of Suboptimal Estimation of Mutual Information in Feature Selection and Classification

Gene regulation network inference using k-nearest neighbor-based mutual information estimation: revisiting an old DREAM

A robust estimator of mutual information for deep learning interpretability

Estimation and Confidence Intervals for Mutual Information: Issues in Convergence for Non-Normal Distributions