DeepProphet2 -- A Deep Learning Gene Recommendation Engine

Daniele Brambilla,Davide Maria Giacomini,Luca Muscarnera,Andrea Mazzoleni
DOI: https://doi.org/10.48550/arXiv.2208.01918
2023-03-22
Abstract:New powerful tools for tackling life science problems have been created by recent advances in machine learning. The purpose of the paper is to discuss the potential advantages of gene recommendation performed by artificial intelligence (AI). Indeed, gene recommendation engines try to solve this problem: if the user is interested in a set of genes, which other genes are likely to be related to the starting set and should be investigated? This task was solved with a custom deep learning recommendation engine, DeepProphet2 (DP2), which is freely available to researchers worldwide via <a class="link-external link-https" href="https://www.generecommender.com?utm_source=DeepProphet2_paper&amp;utm_medium=pdf" rel="external noopener nofollow">this https URL</a>. Hereafter, insights behind the algorithm and its practical applications are illustrated. The gene recommendation problem can be addressed by mapping the genes to a metric space where a distance can be defined to represent the real semantic distance between them. To achieve this objective a transformer-based model has been trained on a well-curated freely available paper corpus, PubMed. The paper describes multiple optimization procedures that were employed to obtain the best bias-variance trade-off, focusing on embedding size and network depth. In this context, the model's ability to discover sets of genes implicated in diseases and pathways was assessed through cross-validation. A simple assumption guided the procedure: the network had no direct knowledge of pathways and diseases but learned genes' similarities and the interactions among them. Moreover, to further investigate the space where the neural network represents genes, the dimensionality of the embedding was reduced, and the results were projected onto a human-comprehensible space. In conclusion, a set of use cases illustrates the algorithm's potential applications in a real word setting.
Quantitative Methods,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
This paper aims to solve the gene recommendation problem, that is, if a user is interested in a specific set of genes, which other genes may be related to this set of genes and are worthy of further study. Specifically, the paper proposes a deep - learning - based gene recommendation engine - DeepProphet2 (DP2), which can recommend related genes by analyzing the semantic distance between genes. ### Problems the paper attempts to solve 1. **How to effectively encode knowledge**: The paper explores how to effectively encode the knowledge of genes and their inter - relationships in the model, so that the model can understand the complex relationships between genes and make reasonable recommendations. 2. **How to understand the system reasoning process**: The paper also focuses on how to explain the reasoning carried out by the system during the recommendation process to ensure the rationality and reliability of the recommendation results. 3. **How to formalize human reasoning**: The paper proposes how to formalize the reasoning process of humans in gene research into a language that machines can understand, thereby achieving automated gene recommendation. ### Main methods and techniques - **Transformer architecture**: The paper uses a deep - learning model based on Transformer, which captures the relationships between genes through the self - attention mechanism. - **Embedding space**: The model maps genes to a high - dimensional embedding space, in which the semantic distance between genes can be measured by the inner product. - **Data augmentation**: To improve the robustness and generalization ability of the model, the paper adopts data augmentation techniques to increase the diversity of training data by generating partial sequences. - **Validation methods**: The paper evaluates the performance of the model through cross - validation and the Receiver Operating Characteristic (ROC) curve to ensure the performance of the model on different gene sets. ### Application scenarios - **Disease - related gene recommendation**: By inputting genes known to be related to a certain disease, the model can recommend other genes that may be related to this disease. - **Pathway - related gene recommendation**: By inputting genes known to participate in a certain biological pathway, the model can recommend other genes that may participate in the same pathway. - **Research target identification**: Researchers can use this model to discover new research targets and accelerate the progress of life science research. ### Conclusion DeepProphet2 (DP2) successfully solves the gene recommendation problem by combining deep - learning and natural - language - processing techniques, providing a powerful tool for life - science research. This model has been verified on multiple benchmark data sets and has shown good performance and reliability. By visiting the GeneRecommender website, researchers around the world can use this tool for free.