Abstract:Single-cell omics has proven to be a powerful instrument for exploring cellular diversity. With advances in sequencing protocols, single-cell studies are now routinely collected from large-scale donor cohorts consisting of samples from hundreds of donors with the goal of uncovering the molecular bases of higher-level donor phenotypes of interest. For example, to better understand the mechanisms behind Alzheimer's disease, recent studies with up to hundreds of samples have investigated the relationships between single-cell omics measurements and donors' neuropathological phenotypes (e.g. Braak staging). In order to ensure the robustness of such findings, it may be desirable to aggregate data from multiple distinct donor cohorts. Unfortunately, doing so is not always straightforward, as different cohorts may be equipped with different sets of phenotype labels. Continuing the previous Alzheimer's example, recent AD study cohorts have provided various subsets of neuropathological phenotypes, cognitive testing results, and APOE genotype. Thus, it is desirable to be able to infer any missing phenotype labels such that all available cell-level data in the study of a given phenotype of interest could be used. Moreover, beyond simply imputing missing phenotype information, it is often of interest to understand which groups of cells and/or molecular features may be most predictive of a given phenotype of interest. As such, there is a pressing need for computational methods that can connect cell-level measurements with donor-level labels. However, accomplishing this task is not straightforward. While a rich literature exists on learning meaningful low-dimensional representations of cells and for inferring corresponding cell-level labels (e.g. cell type), the donor level prediction task introduces substantial additional complexity. For example, different numbers of cells may be recovered from each donor, and thus our prediction model must be able to handle arbitrary numbers of samples as input. Moreover, ideally our model would not a priori require any additional prior knowledge beyond our cell-level measurements, such as the importance of different cell types for a given prediction task. To resolve these issues, here we propose milVI (multiple instance learning variational inference), a deep generative modeling framework that explicitly accounts for donor-level phenotypes and enables inference of missing phenotype labels post-training. In order to handle varying numbers of cells per donor when inferring phenotype labels, milVI leverages recent advances in multiple instance learning. We validated milVI by applying to impute held-out Braak staging information from an Alzheimer's disease study cohort from Mathys et al, and we found that our method achieved lower error on this task compared to naive imputation methods.

flowVI: Flow Cytometry Variational Inference

Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Infinity Flow: High-throughput single-cell quantification of 100s of proteins using conventional flow cytometry and machine learning

Pytometry: Flow and Mass Cytometry Analytics in Python

An Algorithmic Pipeline for Analyzing Multi-parametric Flow Cytometry Data

optimalFlow: Optimal-transport approach to flow cytometry gating and population matching

flowAI: automatic and interactive anomaly discerning tools for flow cytometry data

FlowAtlas: an interactive tool for high-dimensional immunophenotyping analysis bridging FlowJo with computational tools in Julia

FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking

Flow Matching for Scalable Simulation-Based Inference

Flow Cytometry: The Next Revolution

Predicting single-cell gene expression profiles of imaging flow cytometry data with machine learning

Analyzing high-dimensional cytometry data using FlowSOM

A Scalable Pipeline for High-Throughput Flow Cytometry

Statistical file matching of flow cytometry data

Fisher Flow Matching for Generative Modeling over Discrete Data

Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen

A deep generative model for capturing cell to phenotype relationships

Expanding the use of clustering and dimensionality reduction in high parameter flow cytometry data through machine learning for novel samples.

From big flow cytometry datasets to smart diagnostic strategies: The EuroFlow approach

Wasserstein Flow Matching: Generative modeling over families of distributions