Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O'Malley,Manish Bhattarai,Javier Santos
2024-11-07
Abstract:This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Integer Sequences (OEIS), a widely-used resource for mathematical sequences. The benchmark is designed to evaluate both the correctness of the generated code and its computational efficiency. Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences. In order to ensure models do not exploit memorized sequence values, we introduce an automated cheating detection mechanism that flags the use of lookup tables and validated this automation against human cheating evaluations. This benchmark provides a meaningful challenge for current LLMs, offering insights into their mathematical reasoning and code writing capabilities, which can guide future research directions and model development in mathematical reasoning and code synthesis.
Machine Learning,Artificial Intelligence,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate and compare the performance of large language models (LLMs) in generating integer - sequence code, especially their capabilities in mathematical reasoning and computational efficiency. Specifically, the paper evaluates these models by introducing a new benchmark test based on the integer - sequence generation task in the Online Encyclopedia of Integer Sequences (OEIS). This benchmark test examines not only the correctness of the generated code but also its computational efficiency, and prevents the models from cheating by using lookup tables through an automatic detection mechanism. ### Main problems and goals: 1. **Evaluate mathematical reasoning and code - writing abilities**: How capable are current LLMs in handling complex mathematical problems and writing efficient code? 2. **Prevent cheating**: Ensure that the models truly have the ability to generate algorithms rather than relying on memory or lookup tables. 3. **Design diverse and challenging tasks**: Comprehensively evaluate the performance of the models by selecting integer sequences of different difficulty levels. 4. **Provide meaningful challenges**: Provide a testing platform for the current state - of - the - art LLMs that can reveal their strengths and limitations and guide future research directions. ### Solutions: - **Benchmark test design**: 500 integer sequences (250 simple sequences and 250 complex sequences) were selected from OEIS, and the models were required to generate Python code to calculate the first N terms of these sequences. - **Evaluation metrics**: Including accuracy, efficiency, and whether lookup tables were used. - **Cheating - detection mechanism**: Use the structured - output function of GPT - 4o to automatically detect and mark cases where lookup tables were used, ensuring that the models generate legal code. Through this benchmark test, researchers can gain a deeper understanding of the capabilities of LLMs in mathematical reasoning and code synthesis, and discover the shortcomings of existing models, thereby promoting the further development of related fields.