USE OF LATENT-SEMANTIC ANALYSIS IN PREPARATION OF DATA FOR IDENTIFICATION OF ANONYMOUS USERS BY DIGITAL FINGERPRINTS

OLEG I. SHELUHIN,ANNA V. VANYUSHINA,MAKSIM S. ZHELNOV,,,
DOI: https://doi.org/10.36724/2409-5419-2022-14-1-36-44
2022-01-01
H&ES Research
Abstract:Digital fingerprints changings over time as a result of system, plugins, browsers, installation of various programs updates and fonts is a serious problem in the method of tracking and identifying users using a browser (Fingerprinting (FP) of a web browser). The set of parsed attributes can contain both metrical and categorical (mostly non-numeric) values, for example, parameters such as user-agent, webgl, canvas, etc. Considering this, it is required to pre-encode them for the convenience of further processing. For these purposes, artificial intelligence technologies, including the processing of text in natural languages NLP (Natural Language Processing), are widely used. The aim of the research is to analyze the peculiarities of the implementation of latent-semantic analysis (LSA) in the preparation and analysis of FP data for the identification of anonymous users. Methods. A comparative analysis of the common ways of converting categorical values of fingerprint attributes (FP) into numeric One-Hot-Encoding, Label-Encoder, LSA for identifying anonymous users with a predetermined number of possible values of categorical features is carried out. Results. The advantage of the LSA algorithm over One-Hot-Encoding, Label-Encoder is shown. The possibility of clustering implementation within the framework of the user identification problem by visualizing FP (FP) relative to hidden semantic topics using the LSA model of latent semantic analysis is shown. It is shown that with a small number of hid& den topics using the obtained vectors of objects and vectors of terms for assessing the similarity of two FPs, the proposed model allows us to confidently classify the input FP to a common topic. With the help of the obtained vectors of objects and vectors of terms for assessing the similarity of two FP objects, it becomes possible to apply various measures of cluster proximity: Euclidean distance, cosine measure, etc.
What problem does this paper attempt to address?