Abstract:The spatial organization of the genome within the nucleus is partially determined by its interactions with distinct nuclear subcompartments, such as the nuclear lamina and nuclear speckles, which play key roles in gene regulation during development. However, whether these genome-nuclear subcompartment interactions are encoded in the underlying DNA sequence remains poorly understood. The mechanisms for gene regulation are primarily encoded in noncoding DNA sequences, but deciphering how these sequence features control gene expression remains a significant challenge in genomics. Here, we present Nucleotide GPT, a transformer-based model that predicts genomic associations with spatially distinct, physical nuclear subcompartments from DNA sequence alone. Pre-trained on a diverse set of multi-species genomes, we demonstrate Nucleotide GPT's genomic understanding through evaluation on diverse prediction tasks, including histone modifications, promoter detection, and transcription factor binding sites. When fine-tuned to predict genome interactions with two separate nuclear subcompartments - the lamina of the inner nuclear membrane and nuclear speckles that lie more interior - Nucleotide GPT achieves an average accuracy of 73.6% for lamina-associated domains (LADs) and 79.4% accuracy for speckle-associated domains (SPADs), averaged across three cortical development cell types. Analysis of the model's learned representations through Uniform Manifold Approximation and Projection (UMAP) reveals that Nucleotide GPT develops internal embeddings that effectively distinguish LADs from inter-LADs, with predicted probabilities closely corresponding to experimentally determined LAD classifications. When examining these representations in the context of cell type-invariant constitutive LADs (cLADs) compared to cell type-specific LADs, the model assigns lower confidence scores to cell type-specific LADs compared to cLADs that are conserved across neuronal differentiation, suggesting sequence features may play a stronger role in maintaining cLAD associations. Examination of the model's attention patterns at correctly classified regions suggests that specific sequence elements govern model decision making about nuclear subcompartment associations. Our results demonstrate the utility of transformer architectures for studying three-dimensional (3D) genome organization and substantiate a role for DNA sequence in determining nuclear subcompartment associations.
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to explore whether DNA sequences contain sufficient information to predict the spatial associations between the genome and specific sub - compartments in the cell nucleus (such as the nuclear lamina and nuclear speckles). Specifically, the researchers developed a deep - learning model based on the Transformer architecture - Nucleotide GPT, aiming to predict the spatial relationships between the genome and these nuclear sub - compartments solely through DNA sequences. The solution to this problem is of great significance for understanding the mechanisms of gene expression regulation, especially the role of non - coding DNA sequences in the three - dimensional structure of the genome.
### Background of the Paper
- **Genome Organization**: The three - dimensional organization of chromatin in the cell nucleus is crucial for gene expression regulation. Different sub - structures in the nucleus, such as the nuclear lamina and nuclear speckles, play key roles in genome organization and gene regulation.
- **Lamina - Associated Domains (LADs)**: Large chromatin regions associated with the nuclear lamina, usually characterized by low gene density and transcriptional repression.
- **Speckle - Associated Domains (SPADs)**: Membrane - less structures located inside the nucleus, rich in pre - mRNA splicing factors and various proteins involved in gene expression, and the genes within them are highly expressed and show more splicing isoform diversity.
- **The Role of DNA Sequences**: Although the importance of nuclear sub - compartments in genome organization has been widely recognized, it remains unclear whether DNA sequences encode the formation mechanisms of these sub - compartments.
### Research Methods
- **Model Architecture**: Nucleotide GPT is a decoder model, adopting a two - stage training paradigm: pre - training and fine - tuning. In the pre - training stage, cross - species reference genome data are used for self - supervised language modeling to develop a broad understanding of genome sequences. The fine - tuning stage is trained for specific tasks (such as the classification of LADs and SPADs).
- **Dataset**: The pre - training dataset includes the reference genomes of humans, mice, macaques, zebrafish, and fruit flies. The fine - tuning dataset comes from three cell types in the developing human cerebral cortex: radial glial cells (RG), intermediate progenitor cells (IPC), and excitatory neurons (eN).
- **Evaluation**: The model was evaluated on multiple genome prediction tasks, including core promoter detection, standard promoter detection, splicing site prediction, transcription factor binding site prediction, and histone modification prediction.
### Main Results
- **Performance Comparison**: The performance of Nucleotide GPT on multiple genome prediction tasks is comparable to that of the existing state - of - the - art genome - based models, especially in non - TATA promoter detection, splicing site prediction, and transcription factor binding site prediction.
- **Nuclear Lamina Association Patterns**: The fine - tuned Nucleotide GPT can effectively identify sequences related to LADs, inter - LADs, and LAD boundaries. The performance of the model varies in different cell types, with the best performance in radial glial cells (RG) and relatively poor performance in excitatory neurons (eN).
- **Model Explanation**: Through UMAP visualization and attention score analysis, the study found that Nucleotide GPT can learn sequence features that distinguish LADs from inter - LADs, and the classification confidence on conserved LADs (cLADs) is higher than that on cell - type - specific LADs.
### Conclusion
This study has proven through the Nucleotide GPT model that DNA sequences do contain information about nuclear sub - compartment associations, which provides a new perspective for understanding the three - dimensional structure of the genome and gene expression regulation. The successful application of the model shows the great potential of the Transformer architecture in the field of genomics, laying the foundation for further exploration of genome functions and disease mechanisms.