De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks
Mostafa Karimi,Shaowen Zhu,Yue Cao,Yang Shen
DOI: https://doi.org/10.1021/acs.jcim.0c00593
IF: 6.162
2020-09-18
Journal of Chemical Information and Modeling
Abstract:Although massive data is quickly accumulating on protein sequence and structure, there is a small and limited number of protein architectural types (or structural folds). This study is addressing the following question: how well could one reveal underlying sequence–structure relationships and design protein sequences for an arbitrary, potentially novel, structural fold? In response to the question, we have developed novel deep generative models, namely, semisupervised gcWGAN (guided, conditional, Wasserstein Generative Adversarial Networks). To overcome training difficulties and improve design qualities, we build our models on conditional Wasserstein GAN (WGAN) that uses Wasserstein distance in the loss function. Our major contributions include (1) constructing a low-dimensional and generalizable representation of the fold space for the <i>conditional</i> input, (2) developing an ultrafast sequence-to-fold predictor (or oracle) and incorporating its feedback into WGAN as a loss to <i>guide</i> model training, and (3) exploiting sequence data with and without paired structures to enable a <i>semisupervised</i> training strategy. Assessed by the oracle over 100 novel folds not in the training set, gcWGAN generates more successful designs and covers 3.5 times more target folds compared to a competing data-driven method (cVAE). Assessed by sequence- and structure-based predictors, gcWGAN designs are physically and biologically sound. Assessed by a structure predictor over representative novel folds, including one not even part of basis folds, gcWGAN designs have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. The ultrafast data-driven model is further shown to boost the success of a principle-driven de novo method (RosettaDesign), through generating design seeds and tailoring design space. In conclusion, gcWGAN explores uncharted sequence space to design proteins by learning generalizable principles from current sequence–structure data. Data, source codes, and trained models are available at <a class="extLink" href="https://github.com/Shen-Lab/gcWGAN">https://github.com/Shen-Lab/gcWGAN</a>The Supporting Information is available free of charge at <a class="ext-link" href="/doi/10.1021/acs.jcim.0c00593?goto=supporting-info">https://pubs.acs.org/doi/10.1021/acs.jcim.0c00593</a>.Supplementary methods, tables, and figures for the data set, fold representation as conditional input, fold prediction as oracle, GAN models, hyper-parameter tuning, effects of semisupervision, assessing intermediate designs, pipelines for cVAE, Rosetta and RosettaDesign, physical and biological assessment of final designs, and sequence diversity and novelty for case studies (<a class="ext-link" href="/doi/suppl/10.1021/acs.jcim.0c00593/suppl_file/ci0c00593_si_001.pdf">PDF</a>)This article has not yet been cited by other publications.
chemistry, multidisciplinary, medicinal,computer science, interdisciplinary applications, information systems