Abstract:Application of synthetic datasets in training and validation of analysis tools has led to improvements in many decision-making tasks in a range of domains from computer vision to digital pathology. Synthetic datasets overcome the constraints of real-world datasets, namely difficulties in collection and labeling, expense, time, and privacy concerns. In flow cytometry, real cell-based datasets are limited by properties such as size, number of parameters, distance between cell populations, and distributions and are often focused on a narrow range of disease or cell types. Researchers in some cases have designed these desired properties into synthetic datasets; however, operators have implemented them in inconsistent approaches, and there is a scarcity of publicly available, high-quality synthetic datasets. In this research, we propose a method to systematically design and generate flow cytometry synthetic datasets with highly controlled characteristics. We demonstrate the generation of two-cluster synthetic datasets with specific degrees of separation between cell populations, and of non-normal distributions with increasing levels of skewness and orientations of skew pairs. We apply our synthetic datasets to test the performance of a popular automated cell populations identification software, SPADE3, and define the region where the software performance decreases as the clusters get closer together. Application of the synthetic skewed dataset suggests the software is capable of processing non-normal data. We calculate the classification accuracy of SPADE3 with robustness not achievable with real-world datasets. Our approach aims to advance research toward generation of high-quality synthetic flow cytometry datasets and to increase their awareness among the community. The synthetic datasets can be used in benchmarking studies that critically evaluate cell population identification tools and help illustrate potential digital platform inconsistencies. These datasets have the potential to improve cell characterization workflows that integrate automated analysis in clinical diagnostics and cell therapy manufacturing.

Coarsened mixtures of hierarchical skew normal kernels for flow cytometry analyses

Statistical file matching of flow cytometry data

Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions

Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data

[Cell Data Clustering Method in Flow Cytometry Based on Kernel Principal Component Analysis].

Expanding the use of clustering and dimensionality reduction in high parameter flow cytometry data through machine learning for novel samples.

Machine Learning for Flow Cytometry Data Analysis

Automatic Clustering Method of Flow Cytometry Data Based on T-Distributed Stochastic Neighbor Embedding

Auto classification method of flow cytometry data based on kernel entropy component analysis

An Algorithmic Pipeline for Analyzing Multi-parametric Flow Cytometry Data

flowAI: automatic and interactive anomaly discerning tools for flow cytometry data

optimalFlow: Optimal-transport approach to flow cytometry gating and population matching

flowVI: Flow Cytometry Variational Inference

QFMatch: multidimensional flow and mass cytometry samples alignment

Pytometry: Flow and Mass Cytometry Analytics in Python

A Bayesian Feature Allocation Model for Identification of Cell Subpopulations Using Cytometry Data

Information Preserving Component Analysis: Data Projections for Flow Cytometry Analysis

Flow cytometry data analysis: Recent tools and algorithms

Systematic design, generation, and application of synthetic datasets for flow cytometry

Framework for morphometric classification of cells in imaging flow cytometry

High-Speed Automatic Characterization of Rare Events in Flow Cytometric Data