From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori,Naoto Usuyama,Nicholas King,Scott Mayer McKinney,Xavier Fernandes,Sheng Zhang,Eric Horvitz
2024-11-06
Abstract:Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.
Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and evaluate the performance of the new - generation large - language models (LLMs) in medical challenge tasks, especially the performance comparison between OpenAI's o1 - preview model and traditional models (such as GPT - 4) under different runtime strategies. Specifically, the paper attempts to solve the following key problems: 1. **Capabilities of the new model**: - **Performance of the o1 - preview model**: Research the performance of the o1 - preview model on multiple medical benchmark tests, especially whether its performance is better than the traditional GPT - 4 model without using prompt techniques. - **Inherent reasoning ability**: Evaluate whether the o1 - preview model reduces the need for external prompt engineering through its built - in Chain of Thought (CoT) reasoning ability. 2. **Effectiveness of prompt techniques**: - **Effect of classic prompt techniques**: Systematically study the effectiveness of classic prompt engineering techniques (such as Medprompt) in the new - generation reasoning models, especially in the o1 - preview model. - **Impact of few - shot prompting**: Explore the impact of few - shot prompting on the performance of the o1 - preview model and find that few - shot prompting may reduce its performance. 3. **Efficacy of integration methods**: - **Effectiveness of integration strategies**: Research the effect of integration methods (such as multi - model voting) in improving model performance. Although this method is resource - intensive, it is still feasible. - **Trade - off between cost and performance**: Analyze the cost and accuracy of different runtime strategies and reveal a Pareto frontier, showing the balance points between cost and performance of different models. 4. **New benchmark tests**: - **Multilingual benchmark test**: Introduce a new multilingual benchmark test JMLE - 2024 to evaluate the performance of models in handling non - English medical problems and verify whether their performance depends on memorizing existing benchmark test data. 5. **Future research directions**: - **Innovation in reasoning - time strategies**: Explore how to optimize the allocation of computing resources during reasoning time to improve the efficiency, accuracy and reasoning ability of models, especially in the real - time reasoning of large - scale language models. Through these studies, the paper hopes to provide valuable insights for understanding and optimizing the application of the new - generation large - language models in professional fields.