PassTSL: Modeling Human-Created Passwords through Two-Stage Learning

Yangde Wang,Haozhang Li,Weidong Qiu,Shujun Li,Peng Tang
DOI: https://doi.org/10.1007/978-981-97-5101-3_22
2024-07-19
Abstract:Textual passwords are still the most widely used user authentication mechanism. Due to the close connections between textual passwords and natural languages, advanced technologies in natural language processing (NLP) and machine learning (ML) could be used to model passwords for different purposes such as studying human password-creation behaviors and developing more advanced password cracking methods for informing better defence mechanisms. In this paper, we propose PassTSL (modeling human-created Passwords through Two-Stage Learning), inspired by the popular pretraining-finetuning framework in NLP and deep learning (DL). We report how different pretraining settings affected PassTSL and proved its effectiveness by applying it to six large leaked password databases. Experimental results showed that it outperforms five state-of-the-art (SOTA) password cracking methods on password guessing by a significant margin ranging from 4.11% to 64.69% at the maximum point. Based on PassTSL, we also implemented a password strength meter (PSM), and our experiments showed that it was able to estimate password strength more accurately, causing fewer unsafe errors (overestimating the password strength) than two other SOTA PSMs when they produce the same rate of safe errors (underestimating the password strength): a neural-network based method and zxcvbn. Furthermore, we explored multiple finetuning settings, and our evaluations showed that, even a small amount of additional training data, e.g., only 0.1% of the pretrained data, can lead to over 3% improvement in password guessing on average. We also proposed a heuristic approach to selecting finetuning passwords based on JS (Jensen-Shannon) divergence and experimental results validated its usefulness. In summary, our contributions demonstrate the potential and feasibility of applying advanced NLP and ML methods to password modeling and cracking.
Cryptography and Security,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use natural language processing (NLP) and machine learning (ML) techniques to model text passwords created by humans, in order to better study the behavior of humans creating passwords, and develop more advanced password - cracking methods, thereby providing information for improving defense mechanisms. Specifically, the author proposes a method named PassTSL to model passwords created by humans through two - stage learning. The main problems include: 1. **Improve password - cracking performance**: Existing password - cracking methods have limited effectiveness when facing large - scale data. The author hopes to use deep - learning and NLP techniques to improve the effectiveness of password cracking. 2. **Study human password - creation behavior**: Understand how humans create passwords in order to design more secure password policies and defense mechanisms. 3. **Evaluate password strength**: Develop an accurate password strength evaluation tool (Password Strength Meter, PSM) to help users create more secure passwords. ### Specific problems and solutions #### 1. Improve password - cracking performance - **Problem**: Existing password - cracking methods such as Markov models, PCFG (probabilistic context - free grammar), RNN, etc., have limitations when processing large - scale data, especially in capturing long - distance dependencies and complex patterns. - **Solution**: The author proposes PassTSL, which uses a pretraining - finetuning framework, combined with the self - attention mechanism in Transformer, to perform pretraining on a large - scale leaked password database, and then perform finetuning on the target database to improve the accuracy of password cracking. #### 2. Study human password - creation behavior - **Problem**: Understanding how humans create passwords is crucial for designing more effective defense mechanisms, but existing methods are difficult to comprehensively capture these behaviors. - **Solution**: By analyzing the impact of different pretraining settings on PassTSL, the author shows how to use large - scale data to better understand human password - creation behavior, and verifies the effectiveness of mixed - language passwords in modeling English passwords. #### 3. Evaluate password strength - **Problem**: Existing password strength evaluation tools (such as zxcvbn) are effective, but may underestimate or overestimate password strength in some cases. - **Solution**: Based on PassTSL, the author designs a lightweight password strength evaluation tool (PSM), and proves in experiments that it estimates password strength more accurately than other state - of - the - art PSMs (such as FLA - based PSM and zxcvbn), reducing the occurrence of insecure errors. ### Experimental results The author verifies the effectiveness of PassTSL through a large number of experiments: - On six large - scale leaked password databases, PassTSL significantly outperforms five state - of - the - art password - cracking methods, with the maximum advantage reaching 4.11% to 64.69%. - The PSM based on PassTSL shows higher accuracy when evaluating password strength, reducing the overestimation of password strength. In conclusion, this paper successfully solves the above problems by introducing advanced NLP and ML techniques, especially the application under the pretraining - finetuning framework, and shows its potential and feasibility in the fields of password modeling and cracking.