Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction

Musadaq Mansoor,Mohammad Nauman,Hafeez Ur Rehman,Alfredo Benso
DOI: https://doi.org/10.1007/s00500-021-06707-z
IF: 3.732
2022-01-10
Soft Computing
Abstract:One of the most important aspects for a deep interpretation of molecular biology is the precise annotation of protein functions. An overwhelming majority of proteins, across species, do not have sufficient supplementary information available, which causes them to stay uncharacterized. Contrastingly, all known proteins have one key piece of information available: their amino acid sequence. Therefore, for a wider applicability of algorithms, across different species proteins, researchers are motivated to make computational techniques that characterize proteins using their amino acid sequence. However, in case of computational techniques like deep learning algorithms, huge amount of labeled information is required to produce good results. The labeling process of data is time and resource consuming making labeled data scarce. Utilizing the characteristic to address the formerly mentioned issues of uncharacterized proteins and traditional deep learning algorithms, we propose a model called GOGAN, that operates on the amino acid sequence of a protein to predict its functions. Our proposed GOGAN model does not require any handcrafted features, rather it extracts automatically, all the required information from the input sequence. GOGAN model extracts features from the massively large unlabeled protein datasets. The term “Unlabeled data” is used for piece of information that have not been assigned labels to identify their characteristics or properties. The features extracted by GOGAN model can be utilized in other applications like gene variation analysis, gene expression analysis and gene regulation network detection. The proposed model is benchmarked on the Homo sapiens protein dataset extracted from the UniProt database. Experimental results show clear improvements in different evaluation metrics when compared with other methods. Overall, GOGAN achieves an F1 score of 72.1% with Hamming loss of 9.5%, using only the amino acid sequences of protein.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?