Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation

Andrew G Duncan,Jennifer A Mitchell,Alan M Moses
DOI: https://doi.org/10.1093/bioinformatics/btae190
IF: 5.8
2024-03-29
Bioinformatics
Abstract:Abstract Motivation Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. Results Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. Availability and implementation The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enhance the diversity of genomic sequence data by using evolution - related sequences (i.e., homologous sequences), thereby improving the performance of supervised deep learning in functional genomics prediction tasks. Specifically, the paper proposes a method called "phylogenetic augmentation", which increases the diversity of training data by extracting homologous sequences from multi - species genome alignments. This method aims to overcome the limitations of existing data augmentation techniques on genomic sequence data and improve the training efficiency and performance of deep - learning models on small - data sets. ### Main Research Contents 1. **Background and Motivation**: - Supervised deep learning is widely used to model the complex relationships between genomic sequences and regulatory functions. - Understanding how these models make predictions can provide biological insights into regulatory functions. - Genomes may lack sufficient sequence variation to train models with appropriate complexity. - Current data augmentation methods have limitations on genomic sequence data. 2. **Methods**: - **Phylogenetic Augmentation**: Enhance the diversity of training data by extracting homologous sequences from multi - species genome alignments. - **Experimental Design**: Conduct experiments using multiple convolutional neural network (CNN) architectures on different genomic data sets, including STARR - seq data of Drosophila S2 cells, DNase - seq data of human cell lines, and RNA - binding data of yeast 30UTR. - **Model Training**: Apply phylogenetic augmentation during the training process and fine - tune the original data after training to further improve performance. 3. **Results**: - **Performance Improvement**: Phylogenetic augmentation significantly improves the prediction performance of multiple CNN models on test sets, especially on small - data sets. - **Data Efficiency**: Phylogenetic augmentation improves the training efficiency of models on small - data sets and can restore model performance even when the amount of training data is reduced. - **Practical Application**: On the yeast 30UTR data set, phylogenetic augmentation enables the deep - learning model to successfully predict the binding situation of the RNA - binding protein PUF3 with significantly improved performance. ### Conclusion Phylogenetic augmentation is an effective data augmentation method that can improve the performance of supervised deep - learning models in functional genomics prediction tasks, especially on small - data sets. This method increases the diversity of training data by introducing evolution - related sequences, thereby improving the model's generalization ability and prediction accuracy.