Quokka: An Open-source Large Language Model ChatBot for Material Science

Xianjun Yang,Stephen D. Wilson,Linda Petzold
2024-01-02
Abstract:This paper presents the development of a specialized chatbot for materials science, leveraging the Llama-2 language model, and continuing pre-training on the expansive research articles in the materials science domain from the S2ORC dataset. The methodology involves an initial pretraining phase on over one million domain-specific papers, followed by an instruction-tuning process to refine the chatbot's capabilities. The chatbot is designed to assist researchers, educators, and students by providing instant, context-aware responses to queries in the field of materials science. We make the four trained checkpoints (7B, 13B, with or without chat ability) freely available to the research community at
Computation and Language,Artificial Intelligence,Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
This paper proposes a solution to the development of specialized chatbots in the field of materials science. In this study, the authors utilize the Llama-2 language model and perform continuous pre-training on over 1 million research articles in the materials science field using the S2ORC dataset. This approach consists of two main stages: first, pre-training on a large number of professional papers to enhance the model's understanding of materials science knowledge; and then, fine-tuning the chatbot's capabilities through instruction tuning to enable it to understand and answer materials science-related questions. The paper also introduces an open-source language model called Quokka, which provides two different-scale models (7B and 13B) as well as a version with dialogue functionality. These models aim to assist researchers, educators, and students in obtaining instant, contextually relevant answers to materials science queries. The paper concludes by mentioning future work plans, including more fine-grained instruction collection and expanding the model to multimodal to enhance its visual understanding capabilities.