Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction

Eduardo Soares,Emilio Vital Brazil,Karen Fiorela Aquino Gutierrez,Renato Cerqueira,Dan Sanders,Kristin Schmidt,Dmitry Zubarev
2023-06-22
Abstract:We present a novel multimodal language model approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MULTIMODAL-MOLFORMER, utilizes a causal multistage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then integrated with the vector space generated by molecular embeddings from MOLFORMER. In particular, we employ Mordred descriptors as physicochemical features and identify the Markov blanket of the target property, which theoretically contains the most relevant features for accurate prediction. Our results demonstrate a superior performance of our proposed approach compared to existing state-of-the-art algorithms, including the chemical language-based MOLFORMER and graph neural networks, in predicting complex tasks such as biodegradability and PFAS toxicity estimation. Moreover, we demonstrate the effectiveness of our feature selection method in reducing the dimensionality of the Mordred feature space while maintaining or improving the model's performance. Our approach opens up promising avenues for future research in molecular property prediction by harnessing the synergistic potential of both chemical language and physicochemical features, leading to enhanced performance and advancements in the field.
Chemical Physics,Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The paper aims to address the following main issues: ### Main Issues 1. **Combining Chemical Language Representations with Physicochemical Features to Enhance Molecular Property Prediction:** - The paper proposes a new multimodal language model approach (MULTI MODAL-MOLFORMER) that integrates chemical language representations and physicochemical features to improve the accuracy of molecular property predictions. - This method particularly focuses on identifying physicochemical features that have a direct impact on specific target properties through a causal multi-stage feature selection method and combining them with the molecular embedding vector space generated by MOLFORMER. 2. **Addressing the Scarcity of Labeled Data:** - In molecular property prediction tasks, such as the biodegradability of organic molecules and the toxicity estimation of per- and polyfluoroalkyl substances (PFAS), the scarcity of labeled data is a significant challenge. - This study partially addresses this issue by leveraging unlabeled data for pre-training and fine-tuning on downstream tasks. 3. **Improving the Limitations of Existing Chemical Language Models:** - Although chemical language models (such as MOLFORMER) perform well in certain aspects, they may have limitations when handling complex tasks, such as a lack of sensitivity to molecular topology. - The study compensates for these limitations by introducing physicochemical features, further enhancing model performance. ### Specific Problem Solutions - **Proposed Method**: MULTI MODAL-MOLFORMER combines chemical language representations (based on SMILES strings) and physicochemical features (using Mordred descriptors) and determines the most relevant features through causal analysis. - **Feature Selection**: A multi-stage causal feature selection method is employed, identifying the most relevant physicochemical features using the Markov Blanket algorithm. - **Experimental Validation**: The effectiveness and superiority of the proposed method are demonstrated through evaluations on two specific tasks—PFAS toxicity estimation and the prediction of the biodegradability of general compounds. ### Summary The main contribution of this paper is the proposal of a new multimodal language model, MULTI MODAL-MOLFORMER, which effectively addresses key issues in molecular property prediction by combining chemical language representations and physicochemical features. This is particularly beneficial for tasks that rely on limited labeled data. Additionally, experiments have shown that this method outperforms existing state-of-the-art algorithms in predicting complex tasks.