Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents
Janani Muralitharan,Chandrasekar Arumugam
DOI: https://doi.org/10.1007/s00521-024-09707-w
2024-08-25
Neural Computing and Applications
Abstract:In this modern digital era, the increasing volume of textual data and the widespread adoption of natural language processing (NLP) techniques have presented a critical challenge in safeguarding sensitive privacy information. As a result, there is a pressing demand to design robust and accurate NLP-based techniques to perform efficient sensitive information detection in textual data. This research paper focuses on the detection and classification of sensitive privacy information in textual documents using NLP by proposing a novel algorithm named Privacy BERT-LSTM. The proposed Privacy BERT-LSTM algorithm employs BERT for obtaining contextual embeddings and LSTM for sequential information processing, facilitating efficient sensitive information detection in textual documents. The BERT with its bidirectional characteristics captures the nuances and meaning of the textual documents, while the LSTM derives the long-range dependencies in the textual data. Moreover, the proposed Privacy BERT-LSTM algorithm with its attention mechanism highlights the important regions of the textual documents, contributing to efficient sensitive information detection. The comprehensive performance evaluation is conducted by employing the SMS Spam Collection dataset in terms of standard performance metrics and comparing it with different state-of-the-art techniques, namely, CASSED, PRIVAFRAME, CNN-LSTM, Conv-FFD, GCSA, TSIIP, and, C-PIIM. The experimental outcomes clearly illustrate that the Privacy BERT-LSTM algorithm demonstrates superior performance in identifying various types of sensitive information by achieving an accuracy of 92.50%, F1-score of 85.02%, and Precision of 89.36%. The proposed algorithm outperforms existing baseline models, providing valuable advancements in sensitive information detection using NLP. Therefore, this research contributes to the advancement of privacy protection in NLP applications and opens avenues for future investigations in the domain of sensitive information detection. Additionally, the proposed algorithm provides valuable insights for researchers and practitioners working on privacy-sensitive NLP tasks.
computer science, artificial intelligence