Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion

Dingyi Rong,Wenzhuo Zheng,Bozitao Zhong,Zhouhan Lin,Liang Hong,Ning Liu
2024-08-11
Abstract:Accurate prediction of enzyme function is crucial for elucidating biological mechanisms and driving innovation across various sectors. Existing deep learning methods tend to rely solely on either sequence data or structural data and predict the EC number as a whole, neglecting the intrinsic hierarchical structure of EC numbers. To address these limitations, we introduce MAPred, a novel multi-modality and multi-scale model designed to autoregressively predict the EC number of proteins. MAPred integrates both the primary amino acid sequence and the 3D tokens of proteins, employing a dual-pathway approach to capture comprehensive protein characteristics and essential local functional sites. Additionally, MAPred utilizes an autoregressive prediction network to sequentially predict the digits of the EC number, leveraging the hierarchical organization of EC classifications. Evaluations on benchmark datasets, including New-392, Price, and New-815, demonstrate that our method outperforms existing models, marking a significant advance in the reliability and granularity of protein function prediction within bioinformatics.
Quantitative Methods,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address several key issues in enzyme function prediction, particularly the accurate prediction of Enzyme Commission (EC) numbers. The main problems the paper attempts to solve are as follows: 1. **Combining Sequence and Structure Information**: Existing deep learning methods often rely solely on either protein sequence data or structure data for prediction, neglecting the more comprehensive information that could be obtained by combining both. The proposed method (MAPred) integrates the primary sequence of amino acids and the three-dimensional structure information of proteins (represented by 3Di tokens) to achieve a more comprehensive protein characterization. 2. **Multimodal and Multiscale Fusion**: To capture both the global properties of proteins and the features of local functional sites, the paper proposes a dual-path feature extraction network, including a global feature extraction path and a local feature extraction path. This allows the model to understand protein characteristics from different perspectives. 3. **Utilizing Autoregressive Prediction**: Existing methods typically predict the EC number as a whole, ignoring the inherent hierarchical structure of the EC number. MAPred adopts an autoregressive prediction strategy, sequentially predicting each digit of the EC number, thereby better leveraging the hierarchical organization of the EC number. 4. **Improving Prediction Accuracy**: Through the aforementioned innovations, the paper aims to enhance the reliability and granularity of protein function prediction, particularly excelling in the prediction of small sample sizes and low-frequency EC numbers. In summary, this research aims to improve enzyme function prediction, especially the accuracy of EC number prediction, by introducing a multimodal, multiscale autoregressive prediction framework. This approach is expected to advance the understanding of enzyme catalytic mechanisms, substrate specificity, and potential industrial applications in the field of bioinformatics.