Abstract:Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated "biochemical" language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications.

What problem does this paper attempt to address?

The main objective of this paper is to propose a new method for designing compounds with specific pharmacological effects, based on target protein sequences and achieved through a multimodal biochemical language model. Specifically, the authors attempt to address the following issues: 1. **Combining protein and compound information**: By integrating the Protein Language Model (PLM) with the Chemical Language Model (CLM), they created an advanced architecture to predict compounds with the desired pharmacological effects from conditioned protein sequence data. 2. **Designing compounds with predefined pharmacological effects**: They developed a dual-component conditional language model that can learn from protein sequence embeddings and generate new active compounds based on the desired pharmacological effect values. 3. **Validating the method's effectiveness**: They evaluated the proposed model's performance across different activity categories, particularly focusing on whether the model can accurately generate known high-activity compounds from protein sequence information. To achieve these goals, the authors employed the following methods: - Utilizing pre-trained protein language models (e.g., ProtT5XLUniref50) to generate embedded representations of protein sequences. - Developing a conditional transformer model that can learn to map from protein sequence embeddings and desired pharmacological effect values to the corresponding active compounds. - Evaluating the model on a series of test cases, ensuring that the datasets used for fine-tuning and testing are structurally different to demonstrate the model's generalization capability. The results reported in the paper indicate that this new multimodal biochemical language model can successfully generate known compounds with desired pharmacological effects in various scenarios. This demonstrates the conceptual feasibility of the method and highlights the importance of conditional pharmacological effect values in the compound generation task.

Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model

PrefixMol: Target- and Chemistry-aware Molecule Design Via Prefix Embedding

Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

Seq2Mol: Automatic design of de novo molecules conditioned by the target protein sequences through deep neural networks

Adaptive language model training for molecular design

Chemical Language Model Linker: blending text and molecules with modular adapters

Generation of Dual-Target Compounds Using a Transformer Chemical Language Model

cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation

Probabilistic generative transformer language models for generative design of molecules

De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning

Automated design of multi-target ligands by generative deep learning

A Transformer-based Generative Model for De Novo Molecular Design

Leveraging molecular structure and bioactivity with chemical language models for de novo drug design

AlphaFold meets de novo drug design: leveraging structural protein information in multi-target molecular generative models

Chemical Language Models for Molecular Design

Bayesian molecular design with a chemical language model

Large Language Models as Molecular Design Engines

Atom-by-atom protein generation and beyond with language models

Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model

Enhancing molecular design efficiency: Uniting language models and generative networks with genetic algorithms

Interactive Molecular Discovery with Natural Language