Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation

Andrew G Duncan,Jennifer A Mitchell,Alan M Moses

DOI: https://doi.org/10.1093/bioinformatics/btae190

IF: 5.8

2024-03-29

Bioinformatics

Abstract:Abstract Motivation Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. Results Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. Availability and implementation The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.

biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enhance the diversity of genomic sequence data by using evolution - related sequences (i.e., homologous sequences), thereby improving the performance of supervised deep learning in functional genomics prediction tasks. Specifically, the paper proposes a method called "phylogenetic augmentation", which increases the diversity of training data by extracting homologous sequences from multi - species genome alignments. This method aims to overcome the limitations of existing data augmentation techniques on genomic sequence data and improve the training efficiency and performance of deep - learning models on small - data sets. ### Main Research Contents 1. **Background and Motivation**: - Supervised deep learning is widely used to model the complex relationships between genomic sequences and regulatory functions. - Understanding how these models make predictions can provide biological insights into regulatory functions. - Genomes may lack sufficient sequence variation to train models with appropriate complexity. - Current data augmentation methods have limitations on genomic sequence data. 2. **Methods**: - **Phylogenetic Augmentation**: Enhance the diversity of training data by extracting homologous sequences from multi - species genome alignments. - **Experimental Design**: Conduct experiments using multiple convolutional neural network (CNN) architectures on different genomic data sets, including STARR - seq data of Drosophila S2 cells, DNase - seq data of human cell lines, and RNA - binding data of yeast 30UTR. - **Model Training**: Apply phylogenetic augmentation during the training process and fine - tune the original data after training to further improve performance. 3. **Results**: - **Performance Improvement**: Phylogenetic augmentation significantly improves the prediction performance of multiple CNN models on test sets, especially on small - data sets. - **Data Efficiency**: Phylogenetic augmentation improves the training efficiency of models on small - data sets and can restore model performance even when the amount of training data is reduced. - **Practical Application**: On the yeast 30UTR data set, phylogenetic augmentation enables the deep - learning model to successfully predict the binding situation of the RNA - binding protein PUF3 with significantly improved performance. ### Conclusion Phylogenetic augmentation is an effective data augmentation method that can improve the performance of supervised deep - learning models in functional genomics prediction tasks, especially on small - data sets. This method increases the diversity of training data by introducing evolution - related sequences, thereby improving the model's generalization ability and prediction accuracy.

Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation

Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

Assessing the reliability of point mutation as data augmentation for deep learning with genomic data

Learning from Data-Rich Problems: A Case Study on Genetic Variant Calling

EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow

AIKYATAN: mapping distal regulatory elements using convolutional learning on GPU

Pre-training with pseudo-labeling compares favorably with large language models for regulatory sequence prediction

Graphylo: A deep learning approach for predicting regulatory DNA and RNA sites from whole-genome multiple alignments

In silico generation and augmentation of regulatory variants from massively parallel reporter assay using conditional variational autoencoder

Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects

Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models

Advancing regulatory genomics with machine learning

Machine Learning for Large-Scale Genomics: Algorithms, Models and Applications

Training deep learning models on personalized genomic sequences improves variant effect prediction

Semi-supervised deep learning with graph neural network for cross-species regulatory sequence prediction

Predictive analyses of regulatory sequences with EUGENe

A Deep Learning-Based Sequence Analyzer Incorporating the Transcription Factor Binding Affinity to Dissect the Effects of Non-Coding Genetic Variants

Supervised learning on phylogenetically distributed data

Data Augmentation Enhances Plant-Genomic-Enabled Predictions