A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning

Austin Clyde,Tom Brettin,Alexander Partin,Maulik Shaulik,Hyunseung Yoo,Yvonne Evrard,Yitan Zhu,Fangfang Xia,Rick Stevens
DOI: https://doi.org/10.48550/arXiv.2005.00095
2020-05-04
Abstract:By combining various cancer cell line (CCL) drug screening panels, the size of the data has grown significantly to begin understanding how advances in deep learning can advance drug response predictions. In this paper we train >35,000 neural network models, sweeping over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features. We found the inclusion of single nucleotide polymorphisms (SNPs) coded as count matrices improved model performance significantly, and no substantial difference in model performance with respect to molecular featurization between the common open source MOrdred descriptors and Dragon7 descriptors. Alongside this analysis, we outline data integration between CCL screening datasets and present evidence that new metrics and imbalanced data techniques, as well as advances in data standardization, need to be developed.
Machine Learning,Genomics,Quantitative Methods
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to predict the sensitivity of cancer cells to drugs through deep - learning models. Specifically, researchers hope to improve the performance of deep - learning models in predicting cancer drug sensitivity by systematically exploring different featurization techniques. #### Main research questions include: 1. **Selection of feature representation**: - The researchers trained more than 35,000 neural network models and tried a variety of common feature representation methods. They paid special attention to the influence of RNA - seq data, single - nucleotide polymorphisms (SNPs), and molecular descriptors (such as Mordred and Dragon7 descriptors). - The results show that RNA - seq data is highly redundant and informative even when the subset has more than 128 features. In addition, encoding SNPs as a count matrix significantly improves model performance. 2. **Model architecture and hyperparameter optimization**: - The study found that the hyperparameters of deep - learning models (such as optimizer, model architecture, training strategy, and dropout) are more important for model performance than specific feature representations. - Nevertheless, feature selection still has its value, especially under certain specific validation methods (for example, the introduction of SNPs significantly improves the RMSE and r² scores). 3. **Data standardization and imbalanced data processing**: - The researchers emphasize the need to develop new metrics and imbalanced data processing techniques to deal with the differences between different data sources. - For example, using generative adversarial networks (GANs) or batch - effect correction methods to process data from different sources. 4. **Consistency analysis across datasets**: - To ensure the generalization ability of the model, the researchers integrated multiple cancer cell line screening datasets (such as GDSC, NCI - 60, CCLE, etc.) and carried out performance evaluations across datasets. - The results show that there are differences in performance on different datasets, especially in drug validation, where the model's performance is less stable than cell validation. #### Formula presentation: - **Conversion formula from FPKM to log TPM**: \[ \text{log(TPM)}=\log\left(\frac{\text{FPKM}\times 10^{6}}{\sum \text{all FPKM values}}\right) \] - **Root - mean - square error (RMSE)**: \[ \text{RMSE}=\sqrt{\frac{1}{n}\sum_{i = 1}^{n}(y_{i}-\hat{y}_{i})^{2}} \] - **Coefficient of determination (r²)**: \[ r^{2}=1-\frac{\sum(y_{i}-\hat{y}_{i})^{2}}{\sum(y_{i}-\bar{y})^{2}} \] Through these studies, the paper provides important guidance for future cancer drug sensitivity prediction, especially in terms of feature selection, model architecture design, and data processing.