CLDR: Contrastive Learning Drug Response Models from Natural Language Supervision

Kun Li,Wenbin Hu
2023-12-17
Abstract:Deep learning-based drug response prediction (DRP) methods can accelerate the drug discovery process and reduce R\&D costs. Although the mainstream methods achieve high accuracy in predicting response regression values, the regression-aware representations of these methods are fragmented and fail to capture the continuity of the sample order. This phenomenon leads to models optimized to sub-optimal solution spaces, reducing generalization ability and may result in significant wasted costs in the drug discovery phase. In this paper, we propose \MN, a contrastive learning framework with natural language supervision for the DRP. The \MN~converts regression labels into text, which is merged with the captions text of the drug response as a second modality of the samples compared to the traditional modalities (graph, sequence). In each batch, two modalities of one sample are considered positive pairs and the other pairs are considered negative pairs. At the same time, in order to enhance the continuous representation capability of the numerical text, a common-sense numerical knowledge graph is introduced. We validated several hundred thousand samples from the Genomics of Drug Sensitivity in Cancer dataset, observing the average improvement of the DRP method ranges from 7.8\% to 31.4\% with the application of our framework. The experiments prove that the \MN~effectively constrains the samples to a continuous distribution in the representation space, and achieves impressive prediction performance with only a few epochs of fine-tuning after pre-training. The code is available at: \url{<a class="link-external link-https" href="https://gitee.com/xiaoyibang/clipdrug.git" rel="external noopener nofollow">this https URL</a>}.
Biomolecules,Artificial Intelligence,Machine Learning,Molecular Networks
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiency of representation learning in existing Drug - Reaction Prediction (DRP) methods in regression tasks, especially the poor generalization ability under zero - shot learning conditions. Specifically, although traditional DRP methods perform well in predicting the reaction results of drugs on cell lines, their performance drops significantly when dealing with unseen compounds. This is because traditional methods cannot effectively capture the inherent order of continuous numerical values, leading the model to be optimized to a sub - optimal solution space, thus affecting the generalization ability of the model and the cost - effectiveness in practical applications. To solve these problems, the authors propose CLDR (Contrastive Learning Drug Response Models from Natural Language Supervision), which is a contrastive learning framework combined with natural language supervision. The main contributions of CLDR include: 1. **Constructing the connection between drug - reaction data and annotated texts**: By converting continuous numerical labels into natural language texts and using them together with the description texts of the drug - reaction process as the second modality of the sample, the model's understanding and representation ability of continuous numerical values are enhanced. 2. **Introducing the common - sense numerical knowledge graph**: Based on the definition of ordinal numbers, a common - sense numerical knowledge graph (CN - KG) is constructed to enhance the model's perception ability of numerical continuity. 3. **Improving the generalization performance of zero - shot learning**: Through the contrastive learning strategies in the pre - training and fine - tuning stages, CLDR can effectively map samples to the representation space of continuous distribution, improving the prediction performance of the model under zero - shot learning conditions. The experimental results show that CLDR can significantly improve the prediction performance of the model on multiple DRP methods. Especially under zero - shot learning conditions, the average improvement range varies from 7.8% to 31.4%. This proves the effectiveness of CLDR and can significantly improve the pre - clinical screening efficiency and success rate in the drug discovery process.