A comprehensive benchmarking of machine learning algorithms and dimensionality reduction methods for drug sensitivity prediction

Lea Eckhart,Kerstin Lenhof,Lisa-Marie Rolli,Hans-Peter Lenhof
DOI: https://doi.org/10.1093/bib/bbae242
IF: 9.5
2024-05-29
Briefings in Bioinformatics
Abstract:A major challenge of precision oncology is the identification and prioritization of suitable treatment options based on molecular biomarkers of the considered tumor. In pursuit of this goal, large cancer cell line panels have successfully been studied to elucidate the relationship between cellular features and treatment response. Due to the high dimensionality of these datasets, machine learning (ML) is commonly used for their analysis. However, choosing a suitable algorithm and set of input features can be challenging. We performed a comprehensive benchmarking of ML methods and dimension reduction (DR) techniques for predicting drug response metrics. Using the Genomics of Drug Sensitivity in Cancer cell line panel, we trained random forests, neural networks, boosting trees and elastic nets for 179 anti-cancer compounds with feature sets derived from nine DR approaches. We compare the results regarding statistical performance, runtime and interpretability. Additionally, we provide strategies for assessing model performance compared with a simple baseline model and measuring the trade-off between models of different complexity. Lastly, we show that complex ML models benefit from using an optimized DR strategy, and that standard models—even when using considerably fewer features—can still be superior in performance.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: in precision cancer medicine, how to predict drug sensitivity through machine - learning methods. Specifically, what the researchers are faced with is how to select appropriate algorithms and feature sets from high - dimensional datasets to predict the response of drugs to cancer cell lines. Since these datasets usually contain a large amount of gene expression data and other multi - omics data, machine - learning techniques are required for analysis. However, selecting appropriate machine - learning algorithms and dimension - reduction methods is a challenge. Through comprehensive benchmark tests, this paper evaluates the performance of different machine - learning algorithms and dimension - reduction techniques in drug - sensitivity prediction, in the hope of finding the optimal combination scheme. ### Research Background 1. **Challenges in Precision Oncology**: One of the main goals of precision oncology is to identify and prioritize suitable treatment options based on the molecular biomarkers of tumors. 2. **Large - scale Datasets**: Large cancer cell - line panels (such as GDSC and CCLE) provide multi - omics measurements of multiple cancer types and drug - response indicators, which can be used to study the relationship between cell characteristics and treatment outcomes. 3. **High - Dimensional Data**: Due to the high - dimensional nature of these datasets, machine - learning methods are usually required for analysis. But selecting appropriate algorithms and input feature sets is a challenge. ### Research Objectives - **Evaluate Different Machine - Learning Algorithms and Dimension - Reduction Techniques**: Through benchmark tests, evaluate the performance of machine - learning algorithms such as random forests, neural networks, boosted trees, and elastic nets in drug - sensitivity prediction. - **Selection of Dimension - Reduction Techniques**: Evaluate the effects of dimension - reduction techniques such as principal component analysis (PCA) and autoencoders (Autoencoder). - **Performance Comparison**: Compare the performance of different methods in terms of statistical performance, running time, and interpretability. - **Optimization Strategies**: Provide strategies for evaluating model performance and measure the trade - offs between models of different complexities. ### Methods 1. **Datasets**: Use the gene expression values and drug - response indicators (IC50 values) in the GDSC database. 2. **Model Training**: For 179 anti - cancer compounds, use four machine - learning algorithms and nine dimension - reduction techniques to generate more than 16 million models. 3. **Performance Evaluation**: Determine the best hyperparameters through cross - validation (CV) and evaluate the model performance on the test set. ### Results - **Best Performance**: The elastic net model shows the best performance and the lowest running time on most drugs, while the neural network performs the worst. - **Dimension - Reduction Techniques**: PCA and the heuristic method based on minimum redundancy and maximum relevance (MRMR) are the most effective dimension - reduction techniques. - **Feature Selection**: The feature - selection method considering drug response performs better than the method using only expression values. ### Conclusions - **Selecting Appropriate Algorithms and Dimension - Reduction Methods**: Selecting appropriate machine - learning algorithms and dimension - reduction techniques is crucial for drug - sensitivity prediction. - **Effectiveness of Simple Models**: Even with fewer features, standard models may still outperform complex models. - **Optimization of Complex Models**: Complex prediction models can improve their performance by optimizing dimension - reduction strategies. Through these studies, the authors hope to provide more reliable and efficient solutions for drug - sensitivity prediction in precision oncology.