Testing LLM performance on the Physics GRE: some observations

Pranav Gupta
2023-12-08
Abstract:With the recent developments in large language models (LLMs) and their widespread availability through open source models and/or low-cost APIs, several exciting products and applications are emerging, many of which are in the field of STEM educational technology for K-12 and university students. There is a need to evaluate these powerful language models on several benchmarks, in order to understand their risks and limitations. In this short paper, we summarize and analyze the performance of Bard, a popular LLM-based conversational service made available by Google, on the standardized Physics GRE examination.
Physics Education,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores the application and performance of large language models (LLMs) in the field of physics, particularly evaluating these models' performance in the standardized Physics Graduate Record Examination (Physics GRE). The authors chose Google's conversational service Bard as the subject of their study because it supports image input, which is advantageous for physics problems that include special symbols and diagrams. The issues the paper attempts to address include: 1. **Evaluating the capabilities of large language models**: Given the potential applications of large language models in STEM education technology, the paper aims to assess these models' capabilities and limitations through standardized testing. 2. **Understanding model performance in complex subjects**: Due to the complexity of the physics discipline, the authors want to understand whether a large language model like Bard can correctly answer questions involving multiple physics concepts. 3. **Exploring the reliability and accuracy of the model**: Beyond focusing on whether the answers are correct, the research also examines whether the model can correctly reference relevant scientific concepts, avoid random guessing, or generate incorrect information (hallucination). Through the above research, the paper hopes to provide empirical evidence for the application of large language models in physics education and point out directions for future improvements.