nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder

Maksim Kuznetsov,Airat Valiev,Alex Aliper,Daniil Polykovskiy,Elena Tutubalina,Rim Shayakhmetov,Zulfat Miftahutdinov
2024-10-12
Abstract:Recent advancements have integrated Language Models (LMs) into a drug discovery pipeline. However, existing models mostly work with SMILES and SELFIES chemical string representations, which lack spatial features vital for drug discovery. Additionally, attempts to translate chemical 3D structures into text format encounter issues such as excessive length and insufficient atom connectivity information. To address these issues, we introduce nach0-pc, a model combining domain-specific encoder and textual representation to handle spatial arrangement of atoms effectively. Our approach utilizes a molecular point cloud encoder for concise and order-invariant structure representation. We introduce a novel pre-training scheme for molecular point clouds to distillate the knowledge from spatial molecular structures datasets. After fine-tuning within both single-task and multi-task frameworks, nach0-pc demonstrates performance comparable with other diffusion models in terms of generated samples quality across several established spatial molecular generation tasks. Notably, our model is a multi-task approach, in contrast to diffusion models being limited to single tasks. Additionally, it is capable of processing point cloud-related data, which language models are not capable of handling due to memory limitations. These lead to our model having reduced training and inference time while maintaining on par performance.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the challenges encountered by existing language models when dealing with 3D molecular structures in drug discovery. Specifically, the paper points out: 1. **Most existing models rely on chemical string representations such as SMILES and SELFIES**: Although these representation methods can capture the structural information of molecular graphs, they lack the spatial features that are crucial for drug discovery. For example, they cannot accurately describe the spatial arrangement and interaction of atoms. 2. **Methods that attempt to convert chemical 3D structures into text formats have flaws**: The text formats (such as PDB, CIF, and XYZ) generated by these methods are usually too long, requiring dozens of tokens to represent each atom, and lack information on atomic connectivity. In addition, these formats need to rely on external software tools when reconstructing chemical bonds, and these tools are very sensitive to small errors in atomic positions, which may lead to incorrect reconstruction of chemical graphs or the breakage of molecules. To solve these problems, the authors introduced the nach0 - pc model, which is a multi - task language model that combines domain - specific encoders and text representations. The main contributions of this model include: 1. **Proposing a novel nach0 - pc model**: By integrating specialized molecular point - cloud encoders and tokens, the standard encoder - decoder language model is enhanced. This model can effectively handle the input and output of 3D molecular structures. 2. **Introducing a new pre - training method**: By performing dropout operations on entire sub - fragments, the model is trained to predict the missing parts of incomplete molecular point clouds. The incomplete molecular point clouds generated by this method are generated through fragment omission and/or blurring strategies. 3. **Demonstrating the performance of the model through extensive experiments**: On six spatial molecular generation tasks, the performance of the nach0 - pc model is better than or comparable to that of the baseline language model and the state - of - the - art diffusion model. In conclusion, this paper aims to improve the efficiency and accuracy of language models in handling 3D molecular structures by introducing the nach0 - pc model, thereby better supporting the drug discovery process.