Abstract:Large Language Models (LLMs) have shown promise in assisting scientific discovery. However, such applications are currently limited by LLMs' deficiencies in understanding intricate scientific concepts, deriving symbolic equations, and solving advanced numerical calculations. To bridge these gaps, we introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs. We analyze the curated SciInstruct from multiple interesting perspectives (e.g., domain, scale, source, question type, answer length, etc.). To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath, enhancing their scientific and mathematical reasoning capabilities, without sacrificing the language understanding capabilities of the base model. We release all codes and SciInstruct at <a class="link-external link-https" href="https://github.com/THUDM/SciGLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: currently, large language models (LLMs) are deficient in understanding complex scientific concepts, deriving symbolic equations, and solving advanced numerical calculations, resulting in poor performance when dealing with university - level scientific problems. Specifically, even advanced LLMs such as GPT - 3.5 and GPT - 4 can only achieve an accuracy rate of 28.52% on some university - level textbook problems. These problems usually require combining physical concepts and axioms, selecting and deriving formal equations, and strict numerical calculations. To bridge these gaps, the authors introduced **SciInstruct**, an instruction set for training scientific language models, aiming to enable the models to possess university - level scientific reasoning abilities. By constructing SciInstruct, the authors hope to solve the following key problems: 1. **Scarcity of data in scientific fields**: High - quality data in scientific fields is relatively scarce, especially data containing detailed reasoning steps. 2. **Improving scientific reasoning abilities**: By improving the scientific reasoning abilities of LLMs, enabling them to better understand and solve complex scientific problems. 3. **Ensuring the generalization ability of models**: While enhancing scientific reasoning abilities, not sacrificing the performance of models on general language understanding tasks. ### Solutions To solve the above problems, the authors proposed the following main methods: 1. **Self - Reflective Instruction Annotation Framework**: - Utilize existing LLMs to generate the step - by - step reasoning process for unlabeled scientific problems. - Through the self - reflective mechanism (self - reflective critic - and - revise), let LLMs independently identify errors and make corrections until the correct answer is obtained. 2. **Constructing diverse and high - quality data sets**: - SciInstruct covers multiple scientific fields such as physics, chemistry, mathematics, and formal proof (Lean). - Data sources include textbooks, teaching materials, problem sets, etc., ensuring the diversity and coverage of data. 3. **Model fine - tuning and evaluation**: - Use SciInstruct to fine - tune different LLMs (such as ChatGLM3, Llama3 - 8B - Instruct, and Mistral - 7B: MetaMath). - Evaluate the performance of the fine - tuned models in multiple scientific and mathematical benchmark tests to verify their improvements in scientific reasoning tasks. Through these methods, the authors successfully improved the performance of LLMs in scientific reasoning tasks without affecting their abilities in general language understanding tasks. Finally, they released all the code and the SciInstruct data set for further research and application.

SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models

SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

Language Models as Science Tutors

SciAgent: Tool-augmented Language Models for Scientific Reasoning

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research