Abstract:Using information-theoretic quantities in practical applications with continuous data is often hindered by the fact that probability density functions need to be estimated in higher dimensions, which can become unreliable or even computationally unfeasible. To make these useful quantities more accessible, alternative approaches such as binned frequencies using histograms and k-nearest neighbors (k-NN) have been proposed. However, a systematic comparison of the applicability of these methods has been lacking. We wish to fill this gap by comparing kernel-density-based estimation (KDE) with these two alternatives in carefully designed synthetic test cases. Specifically, we wish to estimate the information-theoretic quantities: entropy, Kullback-Leibler divergence, and mutual information, from sample data. As a reference, the results are compared to closed-form solutions or numerical integrals. We generate samples from distributions of various shapes in dimensions ranging from one to ten. We evaluate the estimators' performance as a function of sample size, distribution characteristics, and chosen hyperparameters. We further compare the required computation time and specific implementation challenges. Notably, k-NN estimation tends to outperform other methods, considering algorithmic implementation, computational efficiency, and estimation accuracy, especially with sufficient data. This study provides valuable insights into the strengths and limitations of the different estimation methods for information-theoretic quantities. It also highlights the significance of considering the characteristics of the data, as well as the targeted information-theoretic quantity when selecting an appropriate estimation technique. These findings will assist scientists and practitioners in choosing the most suitable method, considering their specific application and available data. We have collected the compared estimation methods in a ready-to-use open-source Python 3 toolbox and, thereby, hope to promote the use of information-theoretic quantities by researchers and practitioners to evaluate the information in data and models in various disciplines.

Ranking Biomarkers Through Mutual Information

Identifying High-Dimensional Biomarkers for Personalized Medicine via Variable Importance Ranking

Discovering Cooperative Biomarkers for Heterogeneous Complex Disease Diagnoses.

A Flexible Approach for Predictive Biomarker Discovery

On the Effect of Suboptimal Estimation of Mutual Information in Feature Selection and Classification

A distribution-free smoothed combination method of biomarkers to improve diagnostic accuracy in multi-category classification

Mutual Information Multinomial Estimation

An efficient approach for identifying important biomarkers for biomedical diagnosis

Beyond Normal: On the Evaluation of Mutual Information Estimators

Mutual information for detecting multi-class biomarkers when integrating multiple bulk or single-cell transcriptomic studies

Estimating Heterogeneous Treatment Effects: Mutual Information Bounds and Learning Algorithms.

Interactive visual formula composition of multidimensional data classifiers

On the Accurate Estimation of Information-Theoretic Quantities from Multi-Dimensional Sample Data

Rank-one matrix estimation: analysis of algorithmic and information theoretic limits by the spatial coupling method

Efficient screening of predictive biomarkers for individual treatment selection

Nonparametric empirical Bayes biomarker imputation and estimation

Estimating Conditional Mutual Information for Dynamic Feature Selection

Approximating mutual information of high-dimensional variables using learned representations

Bayesian Solutions for Assessing Differential Effects in Biomarker Positive and Negative Subgroups

Latest CHORUS and NOMAD results

Mutual Information Assisted Ensemble Recommender System for Identifying Critical Risk Factors in Healthcare Prognosis