Abstract:Context: The emergence of Large Language Models (LLMs) has significantly transformed Software Engineering (SE) by providing innovative methods for analyzing software repositories. Objectives: Our objective is to establish a practical framework for future SE researchers needing to enhance the data collection and dataset while conducting software repository mining studies using LLMs. Method: This experience report shares insights from two previous repository mining studies, focusing on the methodologies used for creating, refining, and validating prompts that enhance the output of LLMs, particularly in the context of data collection in empirical studies. Results: Our research packages a framework, coined Prompt Refinement and Insights for Mining Empirical Software repositories (PRIMES), consisting of a checklist that can improve LLM usage performance, enhance output quality, and minimize errors through iterative processes and comparisons among different LLMs. We also emphasize the significance of reproducibility by implementing mechanisms for tracking model results. Conclusion: Our findings indicate that standardizing prompt engineering and using PRIMES can enhance the reliability and reproducibility of studies utilizing LLMs. Ultimately, this work calls for further research to address challenges like hallucinations, model biases, and cost-effectiveness in integrating LLMs into workflows.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced when conducting software repository mining research using large - language models (LLMs) in the field of software engineering (SE). Specifically, the paper aims to establish a practical framework to assist future researchers and practitioners: 1. **Improve data collection and dataset enhancement**: Use LLMs to enhance the efficiency and quality of information extraction from software repositories (such as GitHub and Hugging Face). 2. **Standardize prompt engineering**: Provide a set of systematic methods to create, optimize, and validate prompts, ensuring that the outputs of LLMs are more accurate and reliable. 3. **Improve reproducibility and reliability**: Introduce mechanisms to track model results, ensure the reproducibility of research, and reduce problems such as errors and hallucinations. ### Main objectives of the paper The main objective of the paper is to provide researchers and practitioners with a practical framework for using LLMs in software repository mining research. This framework is called "Prompt Refinement and Insights for Mining Empirical Software Repositories (PRIMES)", which includes the following four stages: 1. **Creation of Prompts for Piloting**: - Define research objectives, select appropriate prompt strategies, and develop initial prompts. 2. **Prompt Pilot Test: Validation and Iterative Refinement of the Prompt on a Single LLM**: - Gradually improve the prompt through double - labeling of sample data, statistical validation, and prompt optimization. 3. **Evaluation among multiple LLMs**: - Compare the performance of different LLMs, select the model most suitable for a specific task, and establish a benchmark test. 4. **Output Validation**: - Ensure that the LLM output has the correct format, is non - repetitive, free of hallucinations, and can trace the information source, and automate the validation process to improve efficiency and accuracy. ### Specific problems solved - **Error output caused by prompt complexity**: Improve the understanding ability of LLMs by simplifying prompts and providing specific examples. - **Differences between models**: Select the model most suitable for a specific task by comparing the performance of multiple LLMs. - **Hallucinations and biases**: Reduce the generation of inaccurate information through strict validation procedures and expert evaluation. - **Cost - effectiveness**: Consider the cost of API calls and select a cost - effective LLM solution. Through these methods, the paper hopes to promote research in the field of software engineering and make the application of LLMs more reliable, efficient, and reproducible.

A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering

Large Language Models for Software Engineering: A Systematic Literature Review

Breaking the Silence: the Threats of Using LLMs in Software Engineering

Abstractions, Scenarios, and Prompt Definitions for Process Mining with LLMs: A Case Study

A Survey on Large Language Models for Software Engineering

Leveraging Large Language Models (LLMs) for Process Mining (Technical Report)

LLMs for science: Usage for code generation and data analysis

From Prompt Engineering to Prompt Science With Human in the Loop

Towards standarized benchmarks of LLMs in software modeling tasks: a conceptual framework

Towards Evaluation Guidelines for Empirical Studies involving LLMs

Large Language Models for Software Engineering: Survey and Open Problems

An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project

LLMs as Research Tools: A Large Scale Survey of Researchers' Usage and Perceptions

A Review of Repository Level Prompting for LLMs

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

Prompts Matter: Insights and Strategies for Prompt Engineering in Automated Software Traceability

Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks, and Evaluation Strategies

Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

Apprentices to Research Assistants: Advancing Research with Large Language Models

PRISMA-DFLLM: An Extension of PRISMA for Systematic Literature Reviews using Domain-specific Finetuned Large Language Models