Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

Ben Fauber
2024-06-27
Abstract:We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper primarily explores how to accurately predict the affinity of ligand-protein interactions (LPI), also known as drug-target interactions (DTI). The current methods face challenges in predicting these affinities, which are crucial for molecular screening and optimization in drug discovery processes. In this study, the authors used pretrained small language models (SLMs) and fine-tuned them with domain-specific data instructions to achieve accurate predictions of various affinity values, using only the SMILES string of the ligand and the amino acid sequence of the target protein as inputs. Compared to existing machine learning (ML) and free energy perturbation (FEP) methods, this approach demonstrates significant improvements in predicting LPI affinities. The paper indicates that this accurate predictive ability can accelerate drug discovery activities targeting challenging therapeutic targets. The study also includes a review of existing works such as machine learning, deep learning, and physics-based methods like FEP, highlighting the limitations of these methods, especially when dealing with continuous rather than binary affinity data. Additionally, the paper introduces the construction and formatting of the dataset, as well as the fine-tuning process of the underlying pretrained language models (such as the OPT series). By increasing the number of training instances, the model's performance is enhanced, and the prediction accuracy for different affinity values is improved. Ultimately, these results suggest that SLMs fine-tuned with domain-specific instructions can effectively predict LPI affinities, providing a powerful tool for drug development.