Abstract:In this paper, we investigate the effectiveness of state-of-the-art LLM, i.e., GPT-4, with three different prompting engineering techniques (i.e., basic prompting, in-context learning, and task-specific prompting) against 18 fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code summarization, and code translation. Our quantitative analysis of these prompting strategies suggests that prompt engineering GPT-4 cannot necessarily and significantly outperform fine-tuning smaller/older LLMs in all three tasks. For comment generation, GPT-4 with the best prompting strategy (i.e., task-specific prompt) had outperformed the first-ranked fine-tuned model by 8.33% points on average in BLEU. However, for code generation, the first-ranked fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3% points, on average in BLEU. For code translation, GPT-4 and fine-tuned baselines tie as they outperform each other on different translation tasks. To explore the impact of different prompting strategies, we conducted a user study with 27 graduate students and 10 industry practitioners. From our qualitative analysis, we find that the GPT-4 with conversational prompts (i.e., when a human provides feedback and instructions back and forth with a model to achieve best results) showed drastic improvement compared to GPT-4 with automatic prompting strategies. Moreover, we observe that participants tend to request improvements, add more context, or give specific instructions as conversational prompts, which goes beyond typical and generic prompting strategies. Our study suggests that, at its current state, GPT-4 with conversational prompting has great potential for ASE tasks, but fully automated prompt engineering with no human in the loop requires more study and improvement.

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT.

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Knowledge-Prompted Estimator: A Novel Approach to Explainable Machine Translation Assessment

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations

PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation

MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

Which is better? Exploring Prompting Strategy For LLM-based Metrics

E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks

GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model

Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation

Prompting Large Language Model for Machine Translation: A Case Study

A Study on Performance Improvement of Prompt Engineering for Generative AI with a Large Language Model

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction

Rethinking ChatGPT's Success: Usability and Cognitive Behaviors Enabled by Auto-regressive LLMs' Prompting

A Closer Look into Using Large Language Models for Automatic Evaluation