CataLM: Empowering Catalyst Design Through Large Language Models

Ludi Wang,Xueqing Chen,Yi Du,Yuanchun Zhou,Yang Gao,Wenjuan Cui
2024-05-13
Abstract:The field of catalysis holds paramount importance in shaping the trajectory of sustainable development, prompting intensive research efforts to leverage artificial intelligence (AI) in catalyst design. Presently, the fine-tuning of open-source large language models (LLMs) has yielded significant breakthroughs across various domains such as biology and healthcare. Drawing inspiration from these advancements, we introduce CataLM Cata}lytic Language Model), a large language model tailored to the domain of electrocatalytic materials. Our findings demonstrate that CataLM exhibits remarkable potential for facilitating human-AI collaboration in catalyst knowledge exploration and design. To the best of our knowledge, CataLM stands as the pioneering LLM dedicated to the catalyst domain, offering novel avenues for catalyst discovery and development.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use large - language models (LLMs) to promote the development in the field of catalyst design, especially in electrocatalytic materials. Specifically, the authors have developed a large - language model named CataLM, aiming to overcome the deficiencies of existing models in catalyst knowledge extraction and understanding. Through pre - training and instruction fine - tuning in the field of electrocatalytic materials, CataLM can better understand and process text data related to catalysts, thus providing scientists with a more effective tool for catalyst knowledge exploration and design. ### Main problems: 1. **Complexity and diversity of catalyst knowledge**: The design of catalysts involves multiple variables such as synthesis, composition, structure and performance. This information is scattered in a large number of scientific literatures, and it is difficult to extract useful information from them. 2. **Limitations of existing large - language models**: Although existing large - language models perform well in general fields, they lack sufficient expertise in the catalyst field and cannot meet specific requirements. 3. **Data scarcity and annotation difficulties**: High - quality data sets in the catalyst field are relatively scarce, and they need to be annotated by experts, which increases the difficulty of model training. ### Solutions: - **Development of CataLM**: Based on the Vicuna - 13B model, through two stages of domain pre - training and instruction fine - tuning, the model has a deep understanding of the field of electrocatalytic materials. - **Domain pre - training**: Use a large amount of literature data in the field of electrocatalytic materials for pre - training, so that the model can learn professional terms and knowledge related to catalysts. - **Instruction fine - tuning**: Fine - tune through the data set annotated by experts to further improve the performance of the model on specific tasks, such as entity recognition and control method recommendation. - **Evaluation and verification**: Through experiments on entity recognition and control method recommendation tasks, the effectiveness of CataLM is verified, and its potential in catalyst design is demonstrated. ### Goals: - Provide a powerful tool to help scientists conduct catalyst design and research more efficiently. - Promote the collaboration between humans and AI and accelerate innovation and development in the catalyst field. Through these efforts, CataLM is expected to bring new possibilities for catalyst design and promote the progress of sustainable development - related technologies.