UNNT: A novel Utility for comparing Neural Net and Tree-based models

Vineeth Gutta,Satish Ranganathan Ganakammal,Sara Jones,Matthew Beyers,Sunita Chandrasekaran
DOI: https://doi.org/10.1371/journal.pcbi.1011504
2024-04-30
PLoS Computational Biology
Abstract:The use of deep learning (DL) is steadily gaining traction in scientific challenges such as cancer research. Advances in enhanced data generation, machine learning algorithms, and compute infrastructure have led to an acceleration in the use of deep learning in various domains of cancer research such as drug response problems. In our study, we explored tree-based models to improve the accuracy of a single drug response model and demonstrate that tree-based models such as XGBoost (eXtreme Gradient Boosting) have advantages over deep learning models, such as a convolutional neural network (CNN), for single drug response problems. However, comparing models is not a trivial task. To make training and comparing CNNs and XGBoost more accessible to users, we developed an open-source library called UNNT (A novel Utility for comparing Neural Net and Tree-based models). The case studies, in this manuscript, focus on cancer drug response datasets however the application can be used on datasets from other domains, such as chemistry. Advancement in data science, machine learning (ML), and artificial intelligence (AI) methods has enabled extraction of meaningful information from large and complex datasets that has assisted in better understanding, diagnosing, and treating cancer. The understanding of the drug response domain in cancer research has been accelerated with developing ML models to aid in predicting the effectiveness of the drugs based on a specific genomic molecular feature. In this study we developed a novel robust framework called UNNT (A novel Utility for comparing Neural Net and Tree-based models) that trains and compares deep learning method such as CNN and tree-based method such as XGBoost on the user input dataset. We applied this software to single drug response problem in cancer to identify the best performing ML method based on the National Cancer Institute 60 (NCI60) dataset. In addition, we studied the computational aspects of training each of these models where our results show that neither is evidently superior on both CPUs and GPUs while training. This shows that when both models have similar error rates for a dataset the hardware available determines the model choice for training.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?