Abstract:Abstract Motivation In pathway analysis, we aim to establish a connection between the activity of a particular biological pathway and a difference in phenotype. There are many available methods to perform pathway analysis, many of them rely on an upstream differential expression analysis, and many model the relations between the abundances of the analytes in a pathway as linear relationships. Results Here, we propose a new method for pathway analysis, MIPath, that relies on information theoretical principles and, therefore, does not model the association between pathway activity and phenotype, resulting in relatively few assumptions. For this, we construct a graph of the data points for each pathway using a nearest-neighbor approach and score the association between the structure of this graph and the phenotype of these same samples using Mutual Information while adjusting for the effects of random chance in each score. The initial nearest neighbor approach evades individual gene-level comparisons, hence making the method scalable and less vulnerable to missing values. These properties make our method particularly useful for single-cell data. We benchmarked our method on several single-cell datasets, comparing it to established and new methods, and found that it produces robust, reproducible, and meaningful scores. Availability and implementation Source code is available at https://github.com/statisticalbiotechnology/mipath, or through Python Package Index as “mipathway.”

What problem does this paper attempt to address?

The paper aims to address a problem in biological pathway analysis, namely how to better establish the connection between biological pathway activity and phenotypic differences. The authors propose a new method called MIPath (Mutual Information Pathway Analysis), which is based on information theory principles. This method constructs data point graphs for each pathway and uses Mutual Information (MI) to assess the association between these graph structures and sample phenotypes, while adjusting for the influence of random chance. The main features of MIPath include: 1. **No reliance on linear relationship assumptions**: Unlike many existing methods, MIPath does not assume that the relationships between pathway components or between pathway activity and phenotypes are linear. 2. **Suitable for single-cell data**: This method is particularly well-suited for handling single-cell data because it employs a nearest-neighbor approach to avoid comparisons at the individual gene level, making the method more robust and less susceptible to missing values. 3. **Few assumptions**: By using mutual information as the fundamental metric, this method makes fewer assumptions about how gene products interact within pathways and how these interactions lead to phenotypic changes. 4. **Fast computation**: Even for large-scale datasets, MIPath can complete the analysis in a relatively short time. The main steps of the method mentioned in the paper are as follows: - Construct data point graphs for each pathway using a nearest-neighbor algorithm. - Use the Leiden algorithm to detect modules, identifying groups of data points with similar pathway activity. - Calculate adjusted mutual information scores to quantify the degree of association between pathway states and sample-specific variables (such as phenotypic annotations). Through experimental validation on multiple single-cell datasets, MIPath demonstrated good performance and excelled in identifying target pathways compared to other existing pathway analysis methods. Additionally, the method proved the reproducibility and sensitivity of its results.

Pathway analysis through mutual information

PathAligner Pathway Retrieval and Alignment

Identifying significantly impacted pathways: a comprehensive review and assessment

Path mutual information for a class of biochemical reaction networks

Post-transcriptional knowledge in pathway analysis increases the accuracy of phenotypes classification

A novel signaling pathway impact analysis

Identifying statistical dependence in genomic sequences via mutual information estimates

Mutual information for detecting multi-class biomarkers when integrating multiple bulk or single-cell transcriptomic studies

Network-based pathway enrichment analysis with incomplete network information

Network Methods for Pathway Analysis of Genomic Data

Analysis of Protein Pathway Networks Using Hybrid Properties

Predicting the Pathway Involvement of All Pathway and Associated Compound Entries Defined in the Kyoto Encyclopedia of Genes and Genomes

A Novel Approach for Pathway Inference Based on Network Flow

MPAC: a computational framework for inferring cancer pathway activities from multi-omic data

Pathway level analysis of gene expression using singular value decomposition

An approach of gene regulatory network construction using mixed entropy optimizing context-related likelihood mutual information

Predicting the Pathway Involvement of Metabolites Based on Combined Metabolite and Pathway Features

Semiparametric Mixed Model for Evaluating Pathway-Environment Interaction

Inferring the functional effect of gene expression changes in signaling pathways

CPMI: comprehensive neighborhood-based perturbed mutual information for identifying critical states of complex biological processes

A robust statistical approach for finding informative spatially associated pathways