Abstract:Protecting sensitive information in diagnostic data such as logs, is a critical concern in the industrial software diagnosis and debugging process. While there are many tools developed to automatically redact the logs for identifying and removing sensitive information, they have severe limitations which can cause either over redaction and loss of critical diagnostic information (false positives), or disclosure of sensitive information (false negatives), or both. To address the problem, in this paper, we argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code with logger code augmentation, and checks if the log statement outputs data from sensitive sources by using the data flow graph built from the source code. Appropriate redaction rules are further applied depending on the sensitiveness of the data sources to preserve the privacy information in the logs. We conducted experimental evaluation and comparison with other popular baselines. The results demonstrate that our approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: during the diagnosis and debugging process of industrial software, how to protect sensitive information in diagnostic data (such as logs). Existing automatic log desensitization tools have serious limitations and may lead to over - desensitization (false positives) or leakage of sensitive information (false negatives). To solve these problems, the author proposes a log desensitization method based on source code analysis. ### Specific problem description 1. **Over - desensitization and loss of key diagnostic information (false positives)**: - Existing log desensitization tools may incorrectly identify non - sensitive information as sensitive information and desensitize it, resulting in the loss of important diagnostic information. 2. **Leakage of sensitive information (false negatives)**: - Existing tools may also fail to correctly identify sensitive information, resulting in the leakage of sensitive information. 3. **Limitations of existing methods**: - **Rule - based methods**: Rely on predefined rules and are difficult to effectively identify unseen or modified types or patterns of sensitive information. - **Machine learning methods**: Require a large amount of training data, and the labeling work is heavy and time - consuming. - **False positives and false negatives**: Both of these methods may generate false positives and false negatives, affecting the accuracy and integrity of diagnostic data. ### Solution To address the above problems, the author proposes a log desensitization framework based on source code analysis. This framework is implemented through the following steps: 1. **Construct a data flow graph**: - Extract function information from the source code, construct an abstract syntax tree (AST), and then generate a data flow graph (DFG). Each node represents a variable or statement, and each directed edge represents the actual flow direction of data. 2. **Store and manage the data flow graph**: - Store the constructed data flow graph in an efficient repository for subsequent query and use. 3. **Log analysis and desensitization**: - Analyze each log message, locate its corresponding log statement, and determine the data source by backtracking the data flow graph. Apply the corresponding desensitization rules according to the sensitivity of the data source. Through this method, the author aims to improve the accuracy of sensitive information detection, reduce false positives and false negatives, and thus better protect the privacy information in diagnostic data.

Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis

Application of Chinese medical document anonymization in EMR system

Finding Privacy-relevant Source Code

An Empirical Study of Sensitive Information in Logs

Differentially Private Search Log Sanitization with Optimal Output Utility

A Survey on Differential Privacy for Medical Data Analysis

A Preliminary Study on Sensitive Information Exposure Through Logging.

Exploring Privacy-Preserving Disease Diagnosis: A Comparative Analysis

Turning Privacy Constraints into Syslog Analysis Advantage

Curator Attack: When Blackbox Differential Privacy Auditing Loses Its Power

Anonymization of System Logs for Privacy and Storage Benefits

Anonymously Analyzing Clinical Datasets

Automated Defects Detection and Fix in Logging Statement

High Fidelity Data Reduction for Big Data Security Dependency Analyses.

Specializing network analysis to detect anomalous insider actions

Leveraging Interpretable Feature Representations for Advanced Differential Diagnosis in Computational Medicine

Improved privacy preserving method for periodical SRS publishing

An Approach to Detect Abnormal Submissions for CodeWorkout Dataset

Dynamic Analysis and Debugging of Binary Code for Security Applications

AutoLog: A Log Sequence Synthesis Framework for Anomaly Detection

Towards Automatic Detection and Prioritization of Pre-Logging Overhead: a Case Study of Hadoop Ecosystem