SynAsk: Unleashing the Power of Large Language Models in Organic Synthesis

Chonghuan Zhang,Qianghua Lin,Biwei Zhu,Haopeng Yang,Xiao Lian,Hao Deng,Jiajun Zheng,Kuangbiao Liao
2024-06-14
Abstract:The field of natural language processing (NLP) has witnessed a transformative shift with the emergence of large language models (LLMs), revolutionizing various language tasks and applications, and the integration of LLM into specialized domains enhances their capabilities for domain-specific applications. Notably, NLP has made significant strides in organic chemistry, particularly in predicting synthetic tasks, paving the way for the development of LLMs tailored to the organic chemistry field. In this work, we introduce SynAsk, a comprehensive organic chemistry domain-specific LLM platform developed by AIChemEco Inc. By finetuning an LLM with domain-specific data and integrating it with a chain of thought approach, SynAsk seamlessly accesses our knowledge base and advanced chemistry tools in a question-and-answer format. This includes functionalities such as a basic chemistry knowledge base, molecular information retrieval, reaction performance prediction, retrosynthesis prediction, chemical literature acquisition, and more. This novel methodology synergizes fine-tuning techniques with external resource integration, resulting in an organic chemistry-specific model poised to facilitate research and discovery in the field. Accessible via <a class="link-external link-http" href="http://synask.aichemeco.com" rel="external noopener nofollow">this http URL</a>, SynAsk represents a significant advancement in leveraging NLP for synthetic applications.
Chemical Physics,Biomolecules
What problem does this paper attempt to address?
The main goal of this paper is to introduce and develop a large language model (LLM) platform named SynAsk in the field of organic chemistry. This platform aims to address the following key issues: 1. **Enhancing the professional capabilities of language models in the field of organic chemistry**: Existing large language models, while performing well in natural language processing tasks, face challenges in professional tasks that require a deep understanding of molecular structures. To address this issue, researchers have fine-tuned SynAsk with domain-specific data and integrated chain-of-thought methods, enabling it to better understand and execute tasks related to organic chemistry. 2. **Building a comprehensive organic chemistry tool platform**: SynAsk is not just a language model; it also integrates a series of organic chemistry tools, such as molecular information retrieval, reaction performance prediction, and retrosynthesis prediction, to provide a one-stop solution. This resolves the problem of users needing to find tools from different sources when conducting organic synthesis research. 3. **Improving interaction between the model and external tools**: By optimizing prompt strategies and specially designed tool formats, SynAsk enhances the model's ability to recognize required actions and their corresponding inputs. Additionally, by seamlessly connecting local knowledge bases and internal and external open-source tools through the LangChain framework, the interaction efficiency between the model and tools is improved. 4. **Increasing prediction accuracy**: Particularly in reaction yield prediction, SynAsk can provide more accurate prediction results by training on experimental data of common reaction types. Furthermore, there are significant improvements in tasks such as retrosynthesis planning. 5. **Enhancing cross-model multi-task learning capabilities**: Through comprehensive evaluation, SynAsk performs excellently on multiple metrics, especially achieving significant results in chemistry-related tasks, demonstrating its strong ability to solve complex chemical problems. In summary, SynAsk aims to utilize advanced language model technology and professional knowledge in organic chemistry to create a powerful toolset that promotes the development and innovation of organic chemistry research.