Abstract:In this paper, we investigate the effectiveness of state-of-the-art LLM, i.e., GPT-4, with three different prompting engineering techniques (i.e., basic prompting, in-context learning, and task-specific prompting) against 18 fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code summarization, and code translation. Our quantitative analysis of these prompting strategies suggests that prompt engineering GPT-4 cannot necessarily and significantly outperform fine-tuning smaller/older LLMs in all three tasks. For comment generation, GPT-4 with the best prompting strategy (i.e., task-specific prompt) had outperformed the first-ranked fine-tuned model by 8.33% points on average in BLEU. However, for code generation, the first-ranked fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3% points, on average in BLEU. For code translation, GPT-4 and fine-tuned baselines tie as they outperform each other on different translation tasks. To explore the impact of different prompting strategies, we conducted a user study with 27 graduate students and 10 industry practitioners. From our qualitative analysis, we find that the GPT-4 with conversational prompts (i.e., when a human provides feedback and instructions back and forth with a model to achieve best results) showed drastic improvement compared to GPT-4 with automatic prompting strategies. Moreover, we observe that participants tend to request improvements, add more context, or give specific instructions as conversational prompts, which goes beyond typical and generic prompting strategies. Our study suggests that, at its current state, GPT-4 with conversational prompting has great potential for ASE tasks, but fully automated prompt engineering with no human in the loop requires more study and improvement.

Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities

Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models

Metacognitive Prompting Improves Understanding in Large Language Models

A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education

Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models

MathPrompter: Mathematical Reasoning using Large Language Models

Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks

The Unreasonable Effectiveness of Eccentric Automatic Prompts

RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

Meta Prompting for AI Systems

From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting

Meta Prompting for AGI Systems

Unlocking Structured Thinking in Language Models with Cognitive Prompting

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

Prompting is not a substitute for probability measurements in large language models

A Communication Theory Perspective on Prompting Engineering Methods for Large Language Models

On Prompt Sensitivity of ChatGPT in Affective Computing

Effects of a Prompt Engineering Intervention on Undergraduate Students' AI Self-Efficacy, AI Knowledge and Prompt Engineering Ability: A Mixed Methods Study

Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

Does Prompt Formatting Have Any Impact on LLM Performance?