Abstract:Background Machine learning methods have recently been shown powerful in discovering knowledge from scientific data, offering promising prospects for discovery learning. In the meanwhile, Deep Generative Models like Generative Adversarial Networks (GANs) have excelled in generating synthetic data close to real data. GANs have been extensively employed, primarily motivated by generating synthetic data for privacy preservation, data augmentation, etc. However, certain dimensions of GANs have received limited exploration in current literature. Existing studies predominantly utilize huge datasets, presenting a challenge when dealing with limited, complex datasets. Researchers have high-lighted the ineffectiveness of conventional scores for selecting optimal GANs on limited datasets that exhibit complex high order relationships. Furthermore, current methods evaluate GAN’s performance by comparing synthetic data to real data without assessing the preservation of high-order relationships. Researchers have advocated for more objective GAN evaluation techniques and emphasized the importance of establishing interpretable connections between GAN latent space variables and meaningful data semantics. Results In this study, we used a custom GAN model to generate quality synthetic data for a very limited, complex biological dataset. We successfully recovered cell-lineage developmental story from synthetic data using the ab-initio knowledge discovery method, we previously developed. Our custom GAN model performed better than state-of-the-art cscGAN model, when evaluated for recovering hidden knowledge from limited, complex dataset. Then we devise a temporal dataset specific quantitative scoring mechanism to successfully reproduce GAN results for human and mouse embryonic datasets. Our Latent Space Interpretation (LSI) scheme was able to identify anomalies. We also found that the latent space in GAN effectively captured the semantic information and may be used to interpolate data when the sampling of real data is sparse. Conclusion In summary we used a customized GAN model to generate synthetic data for limited, complex dataset and compared the results with state-of-the-art cscGAN model. Cell-lineage developmental story is recovered as hidden knowledge to evaluate GAN for preserving complex high-order relationships. We formulated a quantitative score to successfully reproduce results on human and mouse embryonic datasets. We designed a LSI scheme to identify anomalies and understand the mechanism by which GAN captures important data semantics in its latent space. ### Competing Interest Statement The authors have declared no competing interest. * scRNA-seq : Single Cell RNA Sequencing ML : Machine Learning GAN : Generative Adversarial Networks DGM : Deep Generative Models SVM : Support Vector Machines RBF : Radial Basis Function RDF : Random Decision Forests cscGAN : Conditional Single Cell Generative Adversarial Networks cwGAN : Conditional Wasserstein Generative Adversarial Networks with Gradient Penalty using Label Smoothing T-PCAVR : Time-point Principal Component Analysis Variance Ratio LSI : Latent Space Interpretation TE : Trophectoderm PL : Pre Lineage ICM : Inner Cell Mass RMSProp : Root Mean Square Propagation ARS : Adjusted Reliability Score OVR : One vs Rest PCA : Principal Component Analysis E3 : Embryonic Day 3 E4 : Embryonic Day 4 E5 : Embryonic Day 5 E6 : Embryonic Day 6 E7 : Embryonic Day 7 E5.25 : Embryonic Day 5.25 E5.5 : Embryonic Day 5.5 E6.25 : Embryonic Day 6.25 E6.5 : Embryonic Day 6.5

Latent generative modeling of long genetic sequences with GANs

Generating Synthetic Mixed-Type Longitudinal Electronic Health Records for Artificial Intelligent Applications

Genome-AC-GAN: Enhancing Synthetic Genotype Generation through Auxiliary Classification

Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks

ClOneHORT: Approaches for Improved Fidelity in Generative Models of Synthetic Genomes

Protect and Extend -- Using GANs for Synthetic Data Generation of Time-Series Medical Records

Enhancing molecular design efficiency: Uniting language models and generative networks with genetic algorithms

StyleGenes: Discrete and Efficient Latent Distributions for GANs

Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs

zGAN: An Outlier-focused Generative Adversarial Network For Realistic Synthetic Data Generation

An overview of biological data generation using generative adversarial networks

Hidden Knowledge Recovery from GAN-generated Single-cell RNA-seq Data

Data Synthesis based on Generative Adversarial Networks

Synthetic Genitourinary Image Synthesis via Generative Adversarial Networks: Enhancing Artificial Intelligence Diagnostic Precision

Privacy-hardened and hallucination-resistant synthetic data generation with logic-solvers

Generative adversarial networks applied to gene expression analysis: An interdisciplinary perspective

Synthetic Genitourinary Image Synthesis via Generative Adversarial Networks: Enhancing AI Diagnostic Precision

Game of GANs: Game-Theoretical Models for Generative Adversarial Networks

Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation