Abstract:This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Integer Sequences (OEIS), a widely-used resource for mathematical sequences. The benchmark is designed to evaluate both the correctness of the generated code and its computational efficiency. Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences. In order to ensure models do not exploit memorized sequence values, we introduce an automated cheating detection mechanism that flags the use of lookup tables and validated this automation against human cheating evaluations. This benchmark provides a meaningful challenge for current LLMs, offering insights into their mathematical reasoning and code writing capabilities, which can guide future research directions and model development in mathematical reasoning and code synthesis.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate and compare the performance of large language models (LLMs) in generating integer - sequence code, especially their capabilities in mathematical reasoning and computational efficiency. Specifically, the paper evaluates these models by introducing a new benchmark test based on the integer - sequence generation task in the Online Encyclopedia of Integer Sequences (OEIS). This benchmark test examines not only the correctness of the generated code but also its computational efficiency, and prevents the models from cheating by using lookup tables through an automatic detection mechanism. ### Main problems and goals: 1. **Evaluate mathematical reasoning and code - writing abilities**: How capable are current LLMs in handling complex mathematical problems and writing efficient code? 2. **Prevent cheating**: Ensure that the models truly have the ability to generate algorithms rather than relying on memory or lookup tables. 3. **Design diverse and challenging tasks**: Comprehensively evaluate the performance of the models by selecting integer sequences of different difficulty levels. 4. **Provide meaningful challenges**: Provide a testing platform for the current state - of - the - art LLMs that can reveal their strengths and limitations and guide future research directions. ### Solutions: - **Benchmark test design**: 500 integer sequences (250 simple sequences and 250 complex sequences) were selected from OEIS, and the models were required to generate Python code to calculate the first N terms of these sequences. - **Evaluation metrics**: Including accuracy, efficiency, and whether lookup tables were used. - **Cheating - detection mechanism**: Use the structured - output function of GPT - 4o to automatically detect and mark cases where lookup tables were used, ensuring that the models generate legal code. Through this benchmark test, researchers can gain a deeper understanding of the capabilities of LLMs in mathematical reasoning and code synthesis, and discover the shortcomings of existing models, thereby promoting the further development of related fields.

Benchmarking Large Language Models with Integer Sequence Generation Tasks

Benchmarking Large Language Models for Math Reasoning Tasks

From Code to Play: Benchmarking Program Search for Games Using Large Language Models

MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks

LMentry: A Language Model Benchmark of Elementary Language Tasks

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Code Simulation Challenges for Large Language Models

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

A Performance Study of LLM-Generated Code on Leetcode

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

Generating Unseen Code Tests In Infinitum

Benchmarking Benchmark Leakage in Large Language Models

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard