Abstract:BackgroundAlthough different quality controls have been applied at different stages of the sample preparation and data analysis to ensure both reproducibility and reliability of RNA-seq results, there are still limitations and bias on the detectability for certain differentially expressed genes (DEGs). Whether the transcriptional dynamics of a gene can be captured accurately depends on experimental design/operation and the following data analysis processes. The workflow of subsequent data processing, such as reads alignment, transcript quantification, normalization, and statistical methods for ultimate identification of DEGs can influence the accuracy and sensitivity of DEGs analysis, producing a certain number of false-positivity or false-negativity. Machine learning (ML) is a multidisciplinary field that employs computer science, artificial intelligence, computational statistics and information theory to construct algorithms that can learn from existing data sets and to make predictions on new data set. ML–based differential network analysis has been applied to predict stress-responsive genes through learning the patterns of 32 expression characteristics of known stress-related genes. In addition, the epigenetic regulation plays critical roles in gene expression, therefore, DNA and histone methylation data has been shown to be powerful for ML-based model for prediction of gene expression in many systems, including lung cancer cells. Therefore, it is promising that ML-based methods could help to identify the DEGs that are not identified by traditional RNA-seq method.ResultsWe identified the top 23 most informative features through assessing the performance of three different feature selection algorithms combined with five different classification methods on training and testing data sets. By comprehensive comparison, we found that the model based on InfoGain feature selection and Logistic Regression classification is powerful for DEGs prediction. Moreover, the power and performance of ML-based prediction was validated by the prediction on ethylene regulated gene expression and the following qRT-PCR.ConclusionsOur study shows that the combination of ML-based method with RNA-seq greatly improves the sensitivity of DEGs identification.

Identifying transcription factor-DNA interactions using machine learning

Modelling the transcription factor DNA-binding affinity using genome-wide ChIP-based data

GenomicLinks: deep learning predictions of 3D chromatin interactions in the maize genome

The DNA binding landscape of the maize AUXIN RESPONSE FACTOR family

Using DNase digestion data to accurately identify transcription factor binding sites.

Ensemble Machine Methods for Analysis of Transcription Factor and DNA Interactions

Genome-wide mapping of transcriptional enhancer candidates using DNA and chromatin features in maize

Transcription factor binding site divergence across maize inbred lines drives transcriptional and phenotypic variation

Using sequence-specific chemical and structural properties of DNA to predict transcription factor binding sites

GenomicLinks: Deep learning predictions of 3D chromatin loops in the maize genome

Measuring Specific Interaction of Transcription Factor Zmdreb1a with Its Dna Responsive Element at the Molecular Level

Understanding Variation in Transcription Factor Binding by Modeling Transcription Factor Genome-Epigenome Interactions.

Identification of Plant Transcription Factor DNA-Binding Sites Using seq-DAP-seq

Using Deep Learning to Predict Transcription Factor Binding Sites Based on Multiple-omics Data

Chromatin Signature and Transcription Factor Binding Provide a Predictive Basis for Understanding Plant Gene Expression.

A Deep Learning-Based Sequence Analyzer Incorporating the Transcription Factor Binding Affinity to Dissect the Effects of Non-Coding Genetic Variants

Predicting the DNA binding specificity of mutated transcription factors using family-level biophysically interpretable machine learning

RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes

Comparative analysis of models in predicting the effects of SNPs on TF-DNA binding using large-scale in vitro and in vivo data

Transfer learning and DNA language models enhance transcription factor binding predictions

Prediction of condition-specific regulatory genes using machine learning