System 2 thinking in OpenAI's o1-preview model: Near-perfect performance on a mathematics exam

Joost de Winter,Dimitra Dodou,Yke Bauke Eisma
DOI: https://doi.org/10.3390/computers13110278
2024-10-25
Abstract:The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the o1 model series, designed to handle System 2-like reasoning. While OpenAI's benchmarks are promising, independent validation is still needed. In this study, we tested the o1-preview model twice on the Dutch 'Mathematics B' final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 62 out of 76, well above the Dutch students' average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contami-nation (i.e., the knowledge cutoff for o1-preview and GPT-4o was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that o1-preview performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of o1-preview, which means that sometimes there is 'luck' (the answer is correct) or 'bad luck' (the output has diverged into something that is incorrect). We demonstrate that the self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI's new model series holds great potential, certain risks must be considered.
Computers and Society,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate whether OpenAI's new large - language model (LLM), o1 - preview, has the ability to perform complex, logical reasoning (i.e., System 2 thinking), and to verify its performance in math exams. Specifically, the researchers hope to answer the following questions: 1. **Can the o1 - preview model perform complex logical reasoning like humans?** - The researchers tested the reasoning ability of the o1 - preview model by having it complete the Dutch high - school Math B exam. This exam is usually used to assess students' abstract math abilities and is of relatively high difficulty. 2. **Does the o1 - preview model perform better than existing LLMs?** - The researchers compared o1 - preview with GPT - 4o, which is one of the most advanced LLMs at present but lacks advanced reasoning abilities. By comparing the scores of the two, the advantage of o1 - preview in logical reasoning can be evaluated. 3. **Is the performance of the o1 - preview model reproducible?** - The researchers tested the same set of test questions multiple times to observe whether the performance of the o1 - preview model was stable and whether there was variability in the output results. 4. **Is the performance of the o1 - preview model affected by training - data contamination?** - To rule out the risk that the model might obtain exam answers from the Internet, the researchers used a new set of math exam questions that were published after the model's knowledge cutoff date to ensure fairness. 5. **Can the self - consistency method improve the accuracy of the o1 - preview model?** - The researchers verified whether this method could improve the model's correct - answer rate by prompting the same question multiple times and choosing the most common answer. ### Main Conclusions The research results show that the o1 - preview model performed excellently in two different math exams, approaching full marks. In particular, in the 2023 exam, o1 - preview scored 76 out of 76, while GPT - 4o scored 66. In the 2024 exam, o1 - preview still maintained a high - level performance, scoring 71 (out of 76), at the 97.8th percentile. This indicates that o1 - preview does have strong logical - reasoning abilities and can handle complex math problems. In addition, the research also found that there is a certain variability in the output of the o1 - preview model, but through multiple prompts and the self - consistency method, the correct - answer rate can be significantly improved. This finding provides valuable references for how to better use LLMs for complex tasks in the future. Overall, this research not only verifies the o1 - preview model's ability in logical reasoning but also explores its potential and limitations in practical applications.