Protein-Ligand Scoring with Convolutional Neural Networks

Matthew Ragoza,Joshua Hochuli,Elisa Idrobo,Jocelyn Sunseri,David Ryan Koes
DOI: https://doi.org/10.1021/acs.jcim.6b00740
2016-12-09
Abstract:Computational approaches to drug discovery can reduce the time and cost associated with experimental assays and enable the screening of novel chemotypes. Structure-based drug design methods rely on scoring functions to rank and predict binding affinities and poses. The ever-expanding amount of protein-ligand binding and structural data enables the use of deep machine learning techniques for protein-ligand scoring. We describe convolutional neural network (CNN) scoring functions that take as input a comprehensive 3D representation of a protein-ligand interaction. A CNN scoring function automatically learns the key features of protein-ligand interactions that correlate with binding. We train and optimize our CNN scoring functions to discriminate between correct and incorrect binding poses and known binders and non-binders. We find that our CNN scoring function outperforms the AutoDock Vina scoring function when ranking poses both for pose prediction and virtual screening.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in drug design, how to more accurately predict the binding affinity of protein - ligand complexes, identify the correct binding mode (pose prediction) and distinguish binders from non - binders (virtual screening). Specifically, the author proposes a method of using convolutional neural networks (CNN) to construct protein - ligand scoring functions, in order to automatically learn and identify the key features that affect binding, thereby improving the accuracy of binding pose selection and virtual screening tasks. The paper mentions that traditional empirical scoring functions and knowledge - based scoring functions parameterize data through predefined functions, such as binding affinity values, while scoring functions using machine - learning methods learn parameters and model structures simultaneously from data, providing greater flexibility and expressiveness. However, this increased expressiveness also increases the risk of overfitting, especially when the training data set is insufficient. Therefore, this study aims to develop a method that can effectively utilize deep - learning techniques, especially CNN, to improve the accuracy of protein - ligand binding prediction and reduce the risk of overfitting. To achieve this goal, the author has developed a CNN - based scoring model that accepts a comprehensive 3D representation of protein - ligand interactions as input and automatically learns the key features related to binding. By comparing with the AutoDock Vina scoring function, the author demonstrates the superior performance of their CNN scoring method in pose prediction and virtual screening tasks. In addition, the paper also explores how to generate informative visual results through the decomposition of atomic contributions in order to better understand the features learned by the model.