Abstract:Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore and evaluate the performance of the new - generation large - language models (LLMs) in medical challenge tasks, especially the performance comparison between OpenAI's o1 - preview model and traditional models (such as GPT - 4) under different runtime strategies. Specifically, the paper attempts to solve the following key problems: 1. **Capabilities of the new model**: - **Performance of the o1 - preview model**: Research the performance of the o1 - preview model on multiple medical benchmark tests, especially whether its performance is better than the traditional GPT - 4 model without using prompt techniques. - **Inherent reasoning ability**: Evaluate whether the o1 - preview model reduces the need for external prompt engineering through its built - in Chain of Thought (CoT) reasoning ability. 2. **Effectiveness of prompt techniques**: - **Effect of classic prompt techniques**: Systematically study the effectiveness of classic prompt engineering techniques (such as Medprompt) in the new - generation reasoning models, especially in the o1 - preview model. - **Impact of few - shot prompting**: Explore the impact of few - shot prompting on the performance of the o1 - preview model and find that few - shot prompting may reduce its performance. 3. **Efficacy of integration methods**: - **Effectiveness of integration strategies**: Research the effect of integration methods (such as multi - model voting) in improving model performance. Although this method is resource - intensive, it is still feasible. - **Trade - off between cost and performance**: Analyze the cost and accuracy of different runtime strategies and reveal a Pareto frontier, showing the balance points between cost and performance of different models. 4. **New benchmark tests**: - **Multilingual benchmark test**: Introduce a new multilingual benchmark test JMLE - 2024 to evaluate the performance of models in handling non - English medical problems and verify whether their performance depends on memorizing existing benchmark test data. 5. **Future research directions**: - **Innovation in reasoning - time strategies**: Explore how to optimize the allocation of computing resources during reasoning time to improve the efficiency, accuracy and reasoning ability of models, especially in the real - time reasoning of large - scale language models. Through these studies, the paper hopes to provide valuable insights for understanding and optimizing the application of the new - generation large - language models in professional fields.

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models

Think Beyond Size: Adaptive Prompting for More Effective Reasoning

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Reasoning Models for Text Mining in Oncology - a Comparison Between o1 Preview and GPT-4o

Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance

PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization

PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling

OpenAI o1-Preview vs. ChatGPT in Healthcare: A New Frontier in Medical AI Reasoning

RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding

Autonomous Prompt Engineering in Large Language Models

Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on Prompt Engineering Strategies

OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models

Boosting Theory-of-Mind Performance in Large Language Models via Prompting

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

LLMs as Method Actors: A Model for Prompt Engineering and Architecture

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Prompt Engineering a Prompt Engineer

Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes