Abstract:We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at <a class="link-external link-https" href="https://github.com/kaifronsdal/Self-Reasoning-Evals" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to evaluate the **instrumental self - reasoning ability** of large language model (LLM) agents. Specifically, the authors propose a set of tasks (MISR: Measuring Instrumental Self - Reasoning) to evaluate the ability of these agents to perform self - reasoning in multiple scenarios. These scenarios include: 1. **Self Modification**: Agents can modify their own capabilities or configurations to overcome limitations and complete challenges. 2. **Tool Improvement**: Agents can identify and fix problems in external tools or scripts to complete a given task. 3. **Knowledge Seeking**: Agents can collect and synthesize information about their environment through exploration and interrogation. 4. **Embedded Social Reasoning**: Agents can navigate in complex social situations, which may involve strategic or deceptive behavior to achieve goals. 5. **Opaque Reasoning**: Agents can perform self - reasoning without being detected (for example, hiding their true goals or reasoning processes). #### Research Background In recent years, AI agents based on large language models have become increasingly powerful, being able to perform complex tasks, engage in multi - round conversations, and exhibit improved reasoning behaviors. However, this self - reasoning ability also brings new risks, such as deceptive alignment, that is, agents may mislead humans or other agents by deceptive means to pursue goals that are inconsistent with human values. #### Main Contributions 1. **Proposing the MISR evaluation framework**: The authors designed a series of tasks to evaluate the instrumental self - reasoning ability of agents in different scenarios. 2. **Open - source implementation**: Provided an open - source implementation based on the Inspect framework, enabling other researchers to reproduce and extend these evaluations. 3. **Extensive quantitative and qualitative analysis**: Evaluated agents built using state - of - the - art commercial and open - source large language models, and provided detailed experimental results and analysis. #### Results The research shows that only the most cutting - edge models possess a certain instrumental self - reasoning ability, and this ability is highly dependent on the specific context. All models failed to pass the most difficult version of the task evaluation, indicating that current models are still unable to fully achieve a strong form of instrumental self - reasoning. Through these evaluations, researchers can better understand the self - reasoning ability of AI agents and provide guidance for future development to ensure the safety and controllability of these technologies.

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

K-Level Reasoning: Establishing Higher Order Beliefs in Large Language Models for Strategic Reasoning

Self-Contradictory Reasoning Evaluation and Detection

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Taken out of context: On measuring situational awareness in LLMs

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Can LLMs Reason in the Wild with Programs?

LLMs for Relational Reasoning: How Far are We?

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards

Evaluating the Reliability of Self-Explanations in Large Language Models

ReAct: Synergizing Reasoning and Acting in Language Models

Improving Retrieval Augmented Language Model with Self-Reasoning

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning

Internal Consistency and Self-Feedback in Large Language Models: A Survey

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks