SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models

Dan Zhang,Ziniu Hu,Sining Zhoubian,Zhengxiao Du,Kaiyu Yang,Zihan Wang,Yisong Yue,Yuxiao Dong,Jie Tang
2024-11-18
Abstract:Large Language Models (LLMs) have shown promise in assisting scientific discovery. However, such applications are currently limited by LLMs' deficiencies in understanding intricate scientific concepts, deriving symbolic equations, and solving advanced numerical calculations. To bridge these gaps, we introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs. We analyze the curated SciInstruct from multiple interesting perspectives (e.g., domain, scale, source, question type, answer length, etc.). To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath, enhancing their scientific and mathematical reasoning capabilities, without sacrificing the language understanding capabilities of the base model. We release all codes and SciInstruct at <a class="link-external link-https" href="https://github.com/THUDM/SciGLM" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: currently, large language models (LLMs) are deficient in understanding complex scientific concepts, deriving symbolic equations, and solving advanced numerical calculations, resulting in poor performance when dealing with university - level scientific problems. Specifically, even advanced LLMs such as GPT - 3.5 and GPT - 4 can only achieve an accuracy rate of 28.52% on some university - level textbook problems. These problems usually require combining physical concepts and axioms, selecting and deriving formal equations, and strict numerical calculations. To bridge these gaps, the authors introduced **SciInstruct**, an instruction set for training scientific language models, aiming to enable the models to possess university - level scientific reasoning abilities. By constructing SciInstruct, the authors hope to solve the following key problems: 1. **Scarcity of data in scientific fields**: High - quality data in scientific fields is relatively scarce, especially data containing detailed reasoning steps. 2. **Improving scientific reasoning abilities**: By improving the scientific reasoning abilities of LLMs, enabling them to better understand and solve complex scientific problems. 3. **Ensuring the generalization ability of models**: While enhancing scientific reasoning abilities, not sacrificing the performance of models on general language understanding tasks. ### Solutions To solve the above problems, the authors proposed the following main methods: 1. **Self - Reflective Instruction Annotation Framework**: - Utilize existing LLMs to generate the step - by - step reasoning process for unlabeled scientific problems. - Through the self - reflective mechanism (self - reflective critic - and - revise), let LLMs independently identify errors and make corrections until the correct answer is obtained. 2. **Constructing diverse and high - quality data sets**: - SciInstruct covers multiple scientific fields such as physics, chemistry, mathematics, and formal proof (Lean). - Data sources include textbooks, teaching materials, problem sets, etc., ensuring the diversity and coverage of data. 3. **Model fine - tuning and evaluation**: - Use SciInstruct to fine - tune different LLMs (such as ChatGLM3, Llama3 - 8B - Instruct, and Mistral - 7B: MetaMath). - Evaluate the performance of the fine - tuned models in multiple scientific and mathematical benchmark tests to verify their improvements in scientific reasoning tasks. Through these methods, the authors successfully improved the performance of LLMs in scientific reasoning tasks without affecting their abilities in general language understanding tasks. Finally, they released all the code and the SciInstruct data set for further research and application.