Abstract:Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated "biochemical" language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications.

Language models can identify enzymatic binding sites in protein sequences

Identification of Enzymatic Active Sites with Unsupervised Language Modeling

Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Atom-by-atom protein generation and beyond with language models

Mapping the space of protein binding sites with sequence-based protein language models

BAPULM: Binding Affinity Prediction using Language Models

Language models in molecular discovery

Structure-informed Language Models Are Protein Designers

Language models generalize beyond natural proteins

Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures

Learning the protein language: Evolution, structure, and function

Hybrid protein-ligand binding residue prediction with protein language models: Does the structure matter?

Structure-Informed Protein Language Model

Advances in the Application of Protein Language Modeling for Nucleic Acid Protein Binding Site Prediction

A language model assistant for biocatalysis

Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model

Large-scale chemical language representations capture molecular structure and properties

Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

Protein language models are performant in structure-free virtual screening

wwLearning the language of proteins and predicting the impact of mutations

Designing proteins with language models