LA4SR: illuminating the dark proteome with generative AI

David R. Nelson,Ashish Kumar Jaiswal,Noha Ismail,Alexandra Mystikou,Kourosh Salehi-Ashtiani
2024-11-11
Abstract:AI language models (LMs) show promise for biological sequence analysis. We re-engineered open-source LMs (GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70M to 12B parameters) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified the algal dark proteome - uncharacterized proteins comprising about 65% of total proteins - validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (>1B) LA4SR models reached high accuracy (F1 > 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes and interpret their outputs in evolutionary and biophysical contexts.
Genomics,Artificial Intelligence,Computation and Language,Quantitative Methods
What problem does this paper attempt to address?
The paper attempts to address several key issues in microbial genomics: 1. **Insufficient classification accuracy**: Traditional bioinformatics tools (such as BLAST and Kraken) often fail to accurately classify many proteins in microalgae genomes when analyzing novel or highly variable sequences. These tools rely on sequence homology and k-mer frequencies but perform poorly when dealing with complex and diverse microalgae genomes. 2. **"Dark proteome" problem**: Approximately 65% of proteins in microalgae genomes are uncharacterized (i.e., the "dark proteome"), and these proteins have no matches in traditional alignment-based methods. This makes it difficult to study the functions and evolutionary history of these proteins. 3. **Low computational efficiency**: Traditional methods have low computational efficiency and slow speed when processing large-scale datasets, failing to meet the demands of high-throughput analysis. To address these issues, the paper proposes the LA4SR framework, which utilizes generative AI and deep learning technologies to improve the classification and analysis of microbial genome sequences. Specifically, the LA4SR framework addresses the issues in the following ways: - **High-accuracy classification**: By redesigning and optimizing open-source language models (such as GPT-2, BLOOM, DistilRoBERTa, etc.), the LA4SR model achieved an F1 score of up to 95% in classification tasks, significantly outperforming traditional methods. - **Fast processing**: The LA4SR model runs 16,580 times faster than BLAST and improves recall by 2.9 times, enabling efficient large-scale data analysis. - **Robustness**: The LA4SR model can achieve high accuracy with a small amount of training data and has strong generalization capabilities for incomplete sequences. - **Interpretability**: The paper provides custom AI interpretation tools that can analyze the relationship between amino acid patterns and the AI generation process, explaining model outputs in evolutionary and biophysical contexts. In summary, the LA4SR framework significantly enhances the classification accuracy and computational efficiency of microbial genome sequences through deep learning technologies, providing new tools and methods for studying the "dark proteome."