Feature extraction using Spectral Clustering for Gene Function Prediction using Hierarchical Multi-label Classification

Miguel Romero,Oscar Ramírez,Jorge Finke,Camilo Rocha
DOI: https://doi.org/10.48550/arXiv.2203.13551
2022-04-29
Abstract:Gene annotation addresses the problem of predicting unknown associations between gene and functions (e.g., biological processes) of a specific organism. Despite recent advances, the cost and time demanded by annotation procedures that rely largely on in vivo biological experiments remain prohibitively high. This paper presents a novel in silico approach for to the annotation problem that combines cluster analysis and hierarchical multi-label classification (HMC). The approach uses spectral clustering to extract new features from the gene co-expression network (GCN) and enrich the prediction task. HMC is used to build multiple estimators that consider the hierarchical structure of gene functions. The proposed approach is applied to a case study on Zea mays, one of the most dominant and productive crops in the world. The results illustrate how in silico approaches are key to reduce the time and costs of gene annotation. More specifically, they highlight the importance of: (i) building new features that represent the structure of gene relationships in GCNs to annotate genes; and (ii) taking into account the structure of biological processes to obtain consistent predictions.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of gene function prediction. Specifically, it combines spectral clustering and hierarchical multi - label classification (HMC) to improve the efficiency and accuracy of gene annotation. The following are the main problems proposed in the paper: 1. **High cost and long - time requirements**: Traditional gene annotation methods based on in - vivo biological experiments require high costs and a large amount of time. To overcome this limitation, researchers have proposed a hybrid method that combines existing knowledge and in - silico methods. 2. **Ignoring the hierarchical relationships between gene functions**: Existing gene annotation methods usually ignore the hierarchical structure between gene functions, which may lead to inconsistent prediction results. For example, if a gene is predicted to have a certain function a but is not predicted to have all of a's ancestral functions, such a prediction is inconsistent. Satisfying the ancestral constraints (i.e., the true - path rule) is very important for improving prediction accuracy and consistency. 3. **Extracting new features from gene co - expression networks**: Gene co - expression networks (GCN) provide rich information, but how to extract useful features from them to improve gene function prediction remains a challenge. This paper proposes a spectral - clustering - based method for extracting new features from GCNs, which can better represent the relationship structure between genes. 4. **Improving prediction performance**: By combining the new features extracted from GCNs and a classification method that considers the hierarchical relationships between gene functions, the paper hopes to improve the overall performance of gene function prediction. ### Specific solutions in the paper To solve the above problems, the paper proposes the following methods: - **Spectral clustering**: Use the spectral clustering algorithm to extract new features from GCNs. These features can capture the co - expression relationships between genes and enrich the information for prediction tasks. - **Hierarchical multi - label classification (HMC)**: Construct multiple estimators, consider the hierarchical structure of gene functions, ensure that the prediction results satisfy the true - path rule, and thus improve the consistency and accuracy of prediction. - **Application case study**: Apply this method to gene function prediction in Zea mays, demonstrating how computational methods can reduce the time and cost of gene annotation. Through these methods, the paper not only improves the performance of gene function prediction but also provides an effective framework that can be applied to gene annotation tasks in other species.