Is BigSMILES the Friend of Polymer Machine Learning?

Haoke Qiu,Zhao-Yan Sun
DOI: https://doi.org/10.26434/chemrxiv-2024-bxxhh-v2
2024-09-05
Abstract:Computational methods, exemplified by machine learning (ML), have provided theoretical guidance and solutions for the development of sustainable polymers, accelerating advancements in materials for societal needs such as equipment, environment, health, and green energy. In previous polymer ML workflows, the Simplified Molecular-Input Line-Entry System (SMILES) notation has consistently served as the primary representation of polymer structures, though the inherent randomness of polymers has long posed challenges for SMILES in the representation learning of polymer ML. Recently, BigSMILES and its extensions have paved the way for more versatile and concise representation of polymer structures. However, whether BigSMILES outperforms SMILES in polymer ML workflows has yet to be systematically explored and demonstrated. To fill this scientific gap, we conducted extensive experiments investigating this question, encompassing a variety of polymer property prediction and inverse design tasks based on both image and text inputs. Our findings reveal that in 11 tasks involving homopolymer systems, BigSMILES-based ML workflows exhibit performance comparable to or even exceeding that of SMILES, underscoring the efficacy of BigSMILES in representing polymer structures. Furthermore, BigSMILES offers a more compact textual representation compared to SMILES, significantly reducing the computational cost of model training, particularly for large language models. Through these comprehensive experiments, we for the first time demonstrate that BigSMILES can achieve performance on par with SMILES, while also facilitating faster model training and reducing energy consumption, which could have a substantial impact on a wide range of polymer tasks in the future, including property prediction (and classification) and polymer generation across various polymer types.
Chemistry
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and verify the performance of BigSMILES representation in the polymer machine - learning (ML) workflow, especially its advantages and limitations when compared with the traditional SMILES representation. Specifically, the paper attempts to answer the following questions: 1. **Is BigSMILES better than SMILES?** Through a series of experiments, the paper systematically compares the performance of BigSMILES and SMILES in polymer ML tasks, including polymer property prediction and inverse design tasks. These tasks are based on image and text inputs and are evaluated using convolutional neural networks (CNN), deep neural networks (DNN), and large - language models (LLM). 2. **The compactness and computational efficiency of BigSMILES**: The paper investigates whether BigSMILES can provide a more compact text representation, thereby reducing the computational cost of model training, especially when dealing with large - scale language models. Research shows that BigSMILES can significantly shorten the training time and reduce energy consumption. 3. **The performance of BigSMILES in generation tasks**: The paper also explores the performance of BigSMILES in polymer generation tasks. Although BigSMILES performs well in some aspects, there are still challenges in generating chemically valid BigSMILES strings, which indicates the need for further optimization of BigSMILES' syntax rules and pre - training models. ### Main findings - **Performance comparison**: In 11 tasks involving homopolymer systems, the ML workflow based on BigSMILES shows performance equivalent to or better than that of SMILES. - **Compact representation**: BigSMILES provides a more compact text representation than SMILES, reducing the computational resources required for model training. - **Training speed**: In large - language models, BigSMILES has a faster training speed, which helps to accelerate model iteration and reduce energy consumption. - **Challenges in generation tasks**: Although BigSMILES performs well in some tasks, it still needs improvement in generating chemically valid BigSMILES strings. Overall, this paper fills the scientific gap of BigSMILES in polymer ML applications and provides an important reference for future research and applications.