Data Imbalance in Drug Response Prediction – Multi-Objective Optimization Approach in Deep Learning Setting

Oleksandr Narykov,Yitan Zhu,Thomas Brettin,Yvonne A. Evrard,Alexander Partin,Fangfang Xia,Maulik Shukla,Priyanka Vasanthakumari,James H. Doroshow,Rick L. Stevens
DOI: https://doi.org/10.1101/2024.03.14.585074
2024-03-15
Abstract:Drug response prediction (DRP) methods tackle the complex task of associating the effectiveness of small molecules with the specific genetic makeup of the patient. Anti-cancer DRP is a particularly challenging task requiring costly experiments as underlying pathogenic mechanisms are broad and associated with multiple genomic pathways. The scientific community has exerted significant efforts to generate public drug screening datasets, giving a path to various machine learning (ML) models that attempt to reason over complex data space of small compounds and biological characteristics of tumors. However, the data depth is still lacking compared to computer vision or natural language processing domains, limiting current learning capabilities. To combat this issue and increase the generalizability of the DRP models, we are exploring strategies that explicitly address the imbalance in the DRP datasets. We reframe the problem as a multi-objective optimization across multiple drugs to maximize deep learning model performance. We implement this approach by constructing Multi-Objective Optimization Regularized by Loss Entropy (MOORLE) loss function and plugging it into a Deep Learning model. We demonstrate the utility of proposed drug discovery methods and make suggestions for further potential application of the work to promote equitable outcomes in the healthcare field.
Cancer Biology
What problem does this paper attempt to address?
This paper mainly discusses the issue of data imbalance in drug response prediction (DRP), especially in the context of deep learning. DRP is a complex task that associates the effects of small molecule drugs with specific gene compositions in patients. In anticancer drug response prediction, due to the involvement of multiple gene pathways, the experimental cost is high and the data is insufficient. The authors propose a multi-objective optimization method called Multi-Objective Optimization Regularized by Loss Entropy (MOORLE) to improve the performance of deep learning models in drug response prediction. They integrate it into the deep learning model by constructing a loss entropy regularization loss function that considers multiple drugs. This method aims to handle the data imbalance problem in the DRP dataset, improve the model's generalization ability, and suggest applying this method to promote fair outcomes in the healthcare field. The paper also discusses traditional machine learning and deep learning models in the DRP field, including random forests, support vector machines, and various neural network architectures. The authors point out that data imbalance may lead to a decrease in model performance, especially in virtual drug screening scenarios, where it is necessary to ensure that the model has reasoning ability for unseen small molecules. To address data imbalance, the paper proposes two strategies: sampling methods (such as undersampling, oversampling, and weighted sampling) and modifying learning algorithms (such as adjusting the loss function and introducing class weights). Then, the paper proposes treating DRP as a multi-objective optimization problem, maximizing the predictive performance for each drug, to balance the prediction performance of different drugs. In the experimental part, the authors analyze the effects of the mixed sampling strategy and entropy regularization loss function in the deep learning model DeepTTA. They perform random split and drug-blind split cross-validation evaluations and use variance analysis to study the impact of sampling strategies and loss functions. In summary, the paper attempts to improve the performance of deep learning models for drug response prediction by using a multi-objective optimization strategy and addressing data imbalance, thereby increasing the efficiency of drug discovery and personalized treatment.