Models trained to predict differential expression across plant organs identify distal and proximal regulatory regions

Michael C Tross,Gavin Duggan,Nikee Shrestha,James C Schnable
DOI: https://doi.org/10.1101/2024.06.04.597477
2024-06-06
Abstract:A large proportion of standing phenotypic variation is explained by genetic variation in noncoding regulatory regions. However,tools for the automated identification and characterization of noncoding regulatory sequences in genomes have lagged far behind those employed to annotate and predict the functions of protein coding sequences. We developed a modified transformer model and trained it to predict relative patterns of expression across a diverse set of tissues given a large sequence window for each gene of interest in the maize (Zea mays) genome. Nucleotides in the input DNA sequence with high saliency in gene expression pattern prediction overlapped with regions identified via comparative genomic or chromatin-based approaches as potential regulatory sequences. High saliency regions identified in a second species, sorghum (Sorghum bicolor), without species-specific training were also associated with potential regulatory sequences in noncoding regions upstream and downstream of each gene of interest. The potential impact of a scaleable and transferable approach to identifying regulatory sequences using saliency calculated from large context window models spans multiple applications. Specific use cases could include genome annotation, interpretation of natural genetic variation, and targeted editing in noncoding regions to alter patterns of levels of gene expression.
Plant Biology
What problem does this paper attempt to address?