TempCharBERT: Keystroke Dynamics for Continuous Access Control Based on Pre-trained Language Models

Matheus Simão,Fabiano Prado,Omar Abdul Wahab,Anderson Avila
2024-11-12
Abstract:With the widespread of digital environments, reliable authentication and continuous access control has become crucial. It can minimize cyber attacks and prevent frauds, specially those associated with identity theft. A particular interest lies on keystroke dynamics (KD), which refers to the task of recognizing individuals' identity based on their unique typing style. In this work, we propose the use of pre-trained language models (PLMs) to recognize such patterns. Although PLMs have shown high performance on multiple NLP benchmarks, the use of these models on specific tasks requires customization. BERT and RoBERTa, for instance, rely on subword tokenization, and they cannot be directly applied to KD, which requires temporal-character information to recognize users. Recent character-aware PLMs are able to process both subwords and character-level information and can be an alternative solution. Notwithstanding, they are still not suitable to be directly fine-tuned for KD as they are not optimized to account for user's temporal typing information (e.g., hold time and flight time). To overcome this limitation, we propose TempCharBERT, an architecture that incorporates temporal-character information in the embedding layer of CharBERT. This allows modeling keystroke dynamics for the purpose of user identification and authentication. Our results show a significant improvement with this customization. We also showed the feasibility of training TempCharBERT on a federated learning settings in order to foster data privacy.
Cryptography and Security,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use pre - trained language models (PLMs) to achieve user identification and continuous access control based on keystroke dynamics (KD). Specifically, the author proposes a new architecture named TempCharBERT to overcome the deficiencies of existing character - aware pre - trained language models (such as CharBERT) when dealing with keystroke dynamics tasks. ### Core Problems of the Paper 1. **Limitations of Existing Models**: - Most pre - trained language models (such as BERT and RoBERTa) rely on subword tokenization, which makes them unable to be directly applied to keystroke dynamics tasks that require capturing time information. - Even the recent character - aware pre - trained language models (such as CharBERT and CharacterBERT), although they can handle subword and character - level information, are still not optimized to be able to handle users' keystroke time information (such as key - hold time and flight time). 2. **Solutions**: - A new architecture, TempCharBERT, is proposed. Based on CharBERT, a temporal - character encoder is introduced to incorporate the keystroke time information (such as hold time and flight time) into the embedding layer. - In this way, TempCharBERT can more accurately capture users' typing patterns, thereby improving the accuracy of user identification and authentication. ### Experimental Verification To verify the effectiveness of TempCharBERT, the author conducted the following experiments: 1. **User Identification**: - The results show that the accuracy rate of TempCharBERT in the user identification task reaches 90.14%, which is significantly better than CharBERT (59.26%) and other baseline models. 2. **User Authentication**: - In the user authentication task, the equal error rate (EER) of TempCharBERT is only 0.0022, far lower than other models, such as SVM (0.0822) and LSTM (0.0498). 3. **Feasibility in Federated Learning Settings**: - The research also explored the feasibility of training TempCharBERT in a federated learning environment to protect user data privacy. The results show that even in a distributed training environment, TempCharBERT still performs excellently. ### Formula Summary The key formulas involved in the paper are as follows: - Calculation of character embedding vectors: \[ e_{i,j} = W_c \ast c_{i,j} \] where \( W_c \) is the character embedding matrix, and \( c_{i,j} \) is the \( j \) - th character in the \( i \) - th subword. - Calculation of time - information embedding vectors: \[ u_{i,j} = T_c \ast d_{i,j} + T_c \ast f_{i,j} \] where \( T_c \) is the time embedding matrix, and \( d_{i,j} \) and \( f_{i,j} \) represent the hold time and flight time of the \( j \) - th character in the \( i \) - th subword respectively. - Hidden state of bidirectional GRU output: \[ h_{i,j}(x) = \text{BI - GRU}(e_{i,j} + u_{i,j}) \] - Final word - level embedding: \[ h_i(x) = [h_{i,1}(x); h_{i,n_i}(x)] \] Through these improvements, TempCharBERT can better capture users' typing characteristics, thus performing well in user identification and authentication tasks.