Towards AI-designed genomes using a variational autoencoder

Natasha K Dudek,Doina Precup
DOI: https://doi.org/10.1101/2023.10.22.563484
2024-06-19
Abstract:Genomes serve as the blueprints for life, encoding elaborate networks of genes whose products must seamlessly interact to support living organisms. Humans' capacity to understand biological systems from scratch is limited by their sheer size and complexity. In this work, we develop a proof-of-concept framework for training a machine learning algorithm to learn the basic genetic principles that underlie genome composition. Our variational autoencoder model, DeepGenomeVector, was trained to take as input corrupted bacterial genetic blueprints (i.e. complete gene sets, henceforth "genome vectors") in which most genes had been "removed", and re-create the original. The resulting model effectively captures the complex dependencies in genomic networks, as evaluated by both qualitative and quantitative metrics. An in-depth functional analysis of a generated gene vector shows that its encoded pathways are interconnected and nearly complete. On the test set, where the model's ability to re-generate the original, uncorrupted genome vector was evaluated, an AUC score of 0.98 and an F1 score of 0.83 provide support for the model's ability to generate diverse, high-quality genome vectors. This work showcases the power of machine learning approaches for synthetic biology and highlights the possibility that just as humans can design an AI that animates a robot, AIs may one day be able to design genomic blueprints that animate carbon-based cells.
Synthetic Biology
What problem does this paper attempt to address?