RedactBuster: Entity Type Recognition from Redacted Documents

Mirco Beltrame,Mauro Conti,Pierpaolo Guglielmin,Francesco Marchiori,Gabriele Orazi
2024-04-20
Abstract:The widespread exchange of digital documents in various domains has resulted in abundant private information being shared. This proliferation necessitates redaction techniques to protect sensitive content and user privacy. While numerous redaction methods exist, their effectiveness varies, with some proving more robust than others. As such, the literature proposes several deanonymization techniques, raising awareness of potential privacy threats. However, while none of these methods are successful against the most effective redaction techniques, these attacks only focus on the anonymized tokens and ignore the sentence context.
Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to conduct deanonymization attacks under the most effective document redaction techniques, in order to identify the types of redacted entities and raise awareness of the risk of privacy leakage**. Specifically, the author proposes a model named **RedactBuster**, which is the first deanonymization model that utilizes sentence context for named - entity recognition (NER). This model aims to determine the types of redacted entities in documents by fine - tuning the state - of - the - art Transformer and deep - learning models. The author evaluated this model using the publicly available Text Anonymization Benchmark (TAB) dataset and demonstrated its high accuracy (up to 0.985) on different document types and entity types. In addition, the author also proposes a countermeasure called "character evasion" to enhance the confidentiality of sensitive information and open - sources the model and test platform, so that researchers and practitioners can evaluate the robustness of new redaction techniques and improve document privacy protection. ### Main Contributions 1. **Propose RedactBuster**: This is the first document deanonymization attack model for the most effective redaction techniques. This method uses the latest machine - learning and deep - learning models to calculate sentence embeddings and perform classification. 2. **Evaluation Framework**: Evaluated this framework on the largest publicly available dataset, Text Anonymization Benchmark (TAB), achieving an accuracy rate of up to 0.985. 3. **Propose Countermeasures**: Proposed a new method called "character evasion", which prevents malicious parties from extracting redacted entity types by swapping specific homographic characters. 4. **Open - source Code**: Open - sourced the framework to help researchers and practitioners evaluate the robustness of new redaction techniques and enhance document privacy. ### Relevant Background With the wide application of digital documents in various fields, protecting sensitive information has become particularly important. Although there are many redaction techniques for protecting user privacy, the security of these techniques is uncertain. Some existing deanonymization attacks mainly focus on redacted tokens, ignoring the context of sentences. RedactBuster improves the accuracy of deanonymization by using sentence context. ### Methodology - **Dataset**: Use the TAB dataset, which contains 1,268 English court cases, annotated and redacted. - **Data Processing**: Includes steps such as pre - processing, feature extraction, and data balancing to ensure that the model can extract useful features from text data. - **Model**: Use the Sentence - BERT model to generate sentence embeddings and improve performance through fine - tuning. Random under - sampling and over - sampling techniques were used during the training process to balance the data distribution. - **Classifier**: Tested multiple machine - learning and deep - learning models, including Random Forest, Support Vector Machine, and XGBoost, etc. Through these methods, RedactBuster can effectively identify the types of redacted entities, thereby revealing potential privacy leakage risks and providing new ideas and technical means for future privacy protection research.