What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to conduct deanonymization attacks under the most effective document redaction techniques, in order to identify the types of redacted entities and raise awareness of the risk of privacy leakage**. Specifically, the author proposes a model named **RedactBuster**, which is the first deanonymization model that utilizes sentence context for named - entity recognition (NER). This model aims to determine the types of redacted entities in documents by fine - tuning the state - of - the - art Transformer and deep - learning models. The author evaluated this model using the publicly available Text Anonymization Benchmark (TAB) dataset and demonstrated its high accuracy (up to 0.985) on different document types and entity types. In addition, the author also proposes a countermeasure called "character evasion" to enhance the confidentiality of sensitive information and open - sources the model and test platform, so that researchers and practitioners can evaluate the robustness of new redaction techniques and improve document privacy protection. ### Main Contributions 1. **Propose RedactBuster**: This is the first document deanonymization attack model for the most effective redaction techniques. This method uses the latest machine - learning and deep - learning models to calculate sentence embeddings and perform classification. 2. **Evaluation Framework**: Evaluated this framework on the largest publicly available dataset, Text Anonymization Benchmark (TAB), achieving an accuracy rate of up to 0.985. 3. **Propose Countermeasures**: Proposed a new method called "character evasion", which prevents malicious parties from extracting redacted entity types by swapping specific homographic characters. 4. **Open - source Code**: Open - sourced the framework to help researchers and practitioners evaluate the robustness of new redaction techniques and enhance document privacy. ### Relevant Background With the wide application of digital documents in various fields, protecting sensitive information has become particularly important. Although there are many redaction techniques for protecting user privacy, the security of these techniques is uncertain. Some existing deanonymization attacks mainly focus on redacted tokens, ignoring the context of sentences. RedactBuster improves the accuracy of deanonymization by using sentence context. ### Methodology - **Dataset**: Use the TAB dataset, which contains 1,268 English court cases, annotated and redacted. - **Data Processing**: Includes steps such as pre - processing, feature extraction, and data balancing to ensure that the model can extract useful features from text data. - **Model**: Use the Sentence - BERT model to generate sentence embeddings and improve performance through fine - tuning. Random under - sampling and over - sampling techniques were used during the training process to balance the data distribution. - **Classifier**: Tested multiple machine - learning and deep - learning models, including Random Forest, Support Vector Machine, and XGBoost, etc. Through these methods, RedactBuster can effectively identify the types of redacted entities, thereby revealing potential privacy leakage risks and providing new ideas and technical means for future privacy protection research.

RedactBuster: Entity Type Recognition from Redacted Documents

To show or not to show: Redacting sensitive text from videos of electronic displays

MASK: A flexible framework to facilitate de-identification of clinical texts

Trustera: A Live Conversation Redaction System

Transforming Redaction: How AI is Revolutionizing Data Protection

Textwash -- automated open-source text anonymisation

DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-Identification

Neural Text Sanitization with Explicit Measures of Privacy Risk

Redact4Trace: A solution for auditing the data and tracing the users in the redactable blockchain

Dutch Named Entity Recognition and De-identification Methods for the Human Resource Domain

Text Sanitization Beyond Specific Domains: Zero-Shot Redaction & Substitution with Large Language Models

Towards Quantifying The Privacy Of Redacted Text

Rethinking Document-Level Relation Extraction: A Reality Check

Redactable Blockchain: Comprehensive Review, Mechanisms, Challenges, Open Issues and Future Research Directions

Who framed Roger Reindeer? De-censorship of Facebook posts by snippet classification

Man vs the machine in the struggle for effective text anonymisation in the age of large language models

Evaluating Dutch Named Entity Recognition and De-Identification Methods in the Human Resource Domain

Toward sensitive document release with privacy guarantees

A Systematic Method on PDF Privacy Leakage Issues