Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis

Lixi Zhou,Lei Yu,Jia Zou,Hong Min
DOI: https://doi.org/10.1145/3603719.3603734
2024-09-26
Abstract:Protecting sensitive information in diagnostic data such as logs, is a critical concern in the industrial software diagnosis and debugging process. While there are many tools developed to automatically redact the logs for identifying and removing sensitive information, they have severe limitations which can cause either over redaction and loss of critical diagnostic information (false positives), or disclosure of sensitive information (false negatives), or both. To address the problem, in this paper, we argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code with logger code augmentation, and checks if the log statement outputs data from sensitive sources by using the data flow graph built from the source code. Appropriate redaction rules are further applied depending on the sensitiveness of the data sources to preserve the privacy information in the logs. We conducted experimental evaluation and comparison with other popular baselines. The results demonstrate that our approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: during the diagnosis and debugging process of industrial software, how to protect sensitive information in diagnostic data (such as logs). Existing automatic log desensitization tools have serious limitations and may lead to over - desensitization (false positives) or leakage of sensitive information (false negatives). To solve these problems, the author proposes a log desensitization method based on source code analysis. ### Specific problem description 1. **Over - desensitization and loss of key diagnostic information (false positives)**: - Existing log desensitization tools may incorrectly identify non - sensitive information as sensitive information and desensitize it, resulting in the loss of important diagnostic information. 2. **Leakage of sensitive information (false negatives)**: - Existing tools may also fail to correctly identify sensitive information, resulting in the leakage of sensitive information. 3. **Limitations of existing methods**: - **Rule - based methods**: Rely on predefined rules and are difficult to effectively identify unseen or modified types or patterns of sensitive information. - **Machine learning methods**: Require a large amount of training data, and the labeling work is heavy and time - consuming. - **False positives and false negatives**: Both of these methods may generate false positives and false negatives, affecting the accuracy and integrity of diagnostic data. ### Solution To address the above problems, the author proposes a log desensitization framework based on source code analysis. This framework is implemented through the following steps: 1. **Construct a data flow graph**: - Extract function information from the source code, construct an abstract syntax tree (AST), and then generate a data flow graph (DFG). Each node represents a variable or statement, and each directed edge represents the actual flow direction of data. 2. **Store and manage the data flow graph**: - Store the constructed data flow graph in an efficient repository for subsequent query and use. 3. **Log analysis and desensitization**: - Analyze each log message, locate its corresponding log statement, and determine the data source by backtracking the data flow graph. Apply the corresponding desensitization rules according to the sensitivity of the data source. Through this method, the author aims to improve the accuracy of sensitive information detection, reduce false positives and false negatives, and thus better protect the privacy information in diagnostic data.