Abstract:Large Language Models (LLMs) have become a focal point of research across various domains, including software engineering, where their capabilities are increasingly leveraged. Recent studies have explored the integration of LLMs into software development tools and frameworks, revealing their potential to enhance performance in text and code-related tasks. Log level is a key part of a logging statement that allows software developers control the information recorded during system runtime. Given that log messages often mix natural language with code-like variables, LLMs' language translation abilities could be applied to determine the suitable verbosity level for logging statements. In this paper, we undertake a detailed empirical analysis to investigate the impact of characteristics and learning paradigms on the performance of 12 open-source LLMs in log level suggestion. We opted for open-source models because they enable us to utilize in-house code while effectively protecting sensitive information and maintaining data security. We examine several prompting strategies, including Zero-shot, Few-shot, and fine-tuning techniques, across different LLMs to identify the most effective combinations for accurate log level suggestions. Our research is supported by experiments conducted on 9 large-scale Java systems. The results indicate that although smaller LLMs can perform effectively with appropriate instruction and suitable techniques, there is still considerable potential for improvement in their ability to suggest log levels.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve how to use large language models (LLMs) to suggest appropriate log levels (such as `debug`, `info`, `warn`, `error`, etc.) for logging statements. Specifically, the author focuses on the following key issues: 1. **Automatic suggestion of log levels**: - The log level is a crucial part of the logging statement, which determines the level of detail of the information recorded during system operation. Selecting an inappropriate log level may lead to important information being ignored or too much irrelevant information, thus affecting the efficiency of log management and analysis. - Due to the huge amount of logs generated by modern software systems, manually selecting the appropriate log level has become very difficult and error - prone. Therefore, automating this process is of great significance. 2. **Evaluating the performance of different LLMs**: - The author selected 12 open - source large language models for experiments, including general - purpose language models (such as BERT, RoBERTa) and code - specific models (such as CodeBERT, GraphCodeBERT), to evaluate their performance on the log - level suggestion task. - Different learning paradigms, including zero - shot, few - shot, and fine - tuning, were used in the experiments to determine which method is the most effective. 3. **Exploring factors affecting model performance**: - The influence of factors such as model size, architectural characteristics (such as text - generation models vs. fill - mask models), and context information (such as the source code of the calling method) on the accuracy of log - level suggestions was studied. - In particular, the role of fine - tuning in improving model performance and the influence of additional context information on model output were explored. 4. **Data privacy and security**: - Due to the privacy risks associated with using proprietary code or sensitive information, the author selected open - source models for research to ensure that these models can be deployed locally, protecting data privacy and maintaining data security. Through the above research, the author hopes to provide an effective automated method for log - level suggestions and provide valuable references for future research.

Studying and Benchmarking Large Language Models For Log Level Suggestion

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study

Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies

On the Evaluation of Large Language Models in Unit Test Generation

Benchmarking Large Language Models for Log Analysis, Security, and Interpretation

Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge

Are Large Language Models Good Statisticians?

Impact of Large Language Models on Generating Software Specifications

LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing

Optimizing Search-Based Unit Test Generation with Large Language Models: an Empirical Study

LLMRec: Benchmarking Large Language Models on Recommendation Task

Towards Optimizing with Large Language Models

Spoken Language Intelligence of Large Language Models for Language Learning

Software Testing with Large Language Models: Survey, Landscape, and Vision

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

A survey on large language models for recommendation