Abstract:Developers usually depend on inserting logging statements into the source code to collect system runtime information. Such logged information is valuable for software maintenance. A logging statement usually prints one or more variables to record vital system status. However, due to the lack of rigorous logging guidance and the requirement of domain-specific knowledge, it is not easy for developers to make proper decisions about which variables to log. To address this need, in this work, we propose an approach to recommend logging variables for developers during development by learning from existing logging statements. Different from other prediction tasks in software engineering, this task has two challenges: 1) Dynamic labels - different logging statements have different sets of accessible variables, which means in this task, the set of possible labels of each sample is not the same. 2) Out-of-vocabulary words - identifiers' names are not limited to natural language words and the test set usually contains a number of program tokens which are out of the vocabulary built from the training set and cannot be appropriately mapped to word embeddings. To deal with the first challenge, we convert this task into a representation learning problem instead of a multi-label classification problem. Given a code snippet which lacks a logging statement, our approach first leverages a neural network with an RNN (recurrent neural network) layer and a self-attention layer to learn the proper representation of each program token, and then predicts whether each token should be logged through a unified binary classifier based on the learned representation. To handle the second challenge, we propose a novel method to map program tokens into word embeddings by making use of the pre-trained word embeddings of natural language tokens. We evaluate our approach on 9 large and high-quality Java projects. Our evaluation results show that the average MAP of our approach is over 0.84, outperforming random guess and an information-retrieval-based method by large margins.

A Semantic-aware Representation Framework for Online Log Analysis

Which Variables Should I Log?

LogEvent2vec: LogEvent-to-Vector Based Anomaly Detection for Large-Scale Logs in Internet of Things

High-precision Online Log Parsing with Large Language Models

Assessing the impact of bag‐of‐words versus word‐to‐vector embedding methods and dimension reduction on anomaly detection from log files

Log2vec: A Heterogeneous Graph Embedding Based Approach for Detecting Cyber Threats within Enterprise

SemLog: A Semantics-based Approach for Anomaly Detection in Big Data System Logs

Biglog: Unsupervised Large-scale Pre-training for a Unified Log Representation

LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs

Summarizing Unstructured Logs in Online Services

Log-based Anomaly Detection Without Log Parsing

A General Framework For Text Semantic Analysis And Clustering On Yelp Reviews

DeepUserLog: Deep Anomaly Detection on User Log Using Semantic Analysis and Key-Value Data

Large-Scale Real-Time Semantic Processing Framework for Internet of Things

A Robust Log Classification Approach Based on Natural Language Processing

LLM-powered Zero-shot Online Log Parsing

Log-based Anomaly Detection based on EVT Theory with feedback

Voxel2vec: A Natural Language Processing Approach to Learning Distributed Representations for Scientific Data.

HEDGE: Heterogeneous Semantic Dynamic Graph Framework for Log Anomaly Detection in Digital Service Network

Learning a Semantic Space of Web Search via Session Data

Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies