Classification of Smartphone Users Using Internet Traffic

Andrey Finkelstein,Ron Biton,Rami Puzis,Asaf Shabtai
DOI: https://doi.org/10.48550/arXiv.1701.00220
2017-01-01
Abstract:Today, smartphone devices are owned by a large portion of the population and have become a very popular platform for accessing the Internet. Smartphones provide the user with immediate access to information and services. However, they can easily expose the user to many privacy risks. Applications that are installed on the device and entities with access to the device's Internet traffic can reveal private information about the smartphone user and steal sensitive content stored on the device or transmitted by the device over the Internet. In this paper, we present a method to reveal various demographics and technical computer skills of smartphone users by their Internet traffic records, using machine learning classification models. We implement and evaluate the method on real life data of smartphone users and show that smartphone users can be classified by their gender, smoking habits, software programming experience, and other characteristics.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use machine - learning classification models to reveal various demographic characteristics and technical computer skills of users by analyzing the Internet traffic records of smartphone users. Specifically, the author aims to answer the following questions: 1. **Privacy Risks**: When smartphone users use the Internet, does their Internet traffic contain information that can be used to infer their personal characteristics? Such information may include the user's gender, age group, educational background, programming experience, etc. 2. **Classification Ability**: Can users be accurately classified based on Internet traffic data, for example, according to characteristics such as gender, smoking habits, and programming experience? 3. **Feature Importance**: Which types of features (such as domain - name features, application - layer features, statistical features, and deep - packet - inspection features) are the most important in the classification process? ### Main Research Contents - **Data Collection**: The author collected Internet traffic data of 143 smartphone users through experiments and required participants to fill in questionnaires to provide their demographic characteristics and technical skills information. - **Feature Extraction**: Four types of features were extracted from Internet traffic records: - **Statistical Features**: Such as packet - size statistics of transmitted and received data, the number of bytes in a session, etc. - **Application - Layer Features**: Such as the SSL/TLS version in the HTTP/HTTPS protocol, the number of Cookies, the Content - Type field, etc. - **Domain - Name Features**: Such as Alexa ranking, WoT security score, website category, etc. - **Deep - Packet - Inspection Features**: Such as the number of HTTP forms, the presence of email addresses, usernames and password fields, etc. - **Machine - Learning Models**: Supervised - learning methods were used to train and evaluate classification models, mainly using Random Forest (RF) and Extra Trees (ET) algorithms, and the model performance was evaluated through Leave - One - Out Cross - Validation. ### Research Results - **Classification Accuracy**: The results show that some characteristics of users can be relatively accurately classified through Internet traffic data. For example, the accuracy rate of gender classification is 83.9%, and the accuracy rate of programming - experience classification is 77.8%. - **Feature Importance Analysis**: Domain - name features dominate in the classification model, indicating that the types and frequencies of websites visited by users have an important impact on the classification results. ### Conclusions and Prospects - **Privacy Risks**: Internet traffic does contain content that can be used to infer user personal information, which may pose a threat to user privacy. - **Future Work**: The author plans to expand the sample size and diversity and further explore other classification methods, such as users' network - security scores. Through this research, the author emphasizes the importance of Internet traffic data in privacy protection and calls for measures to prevent potential privacy - leakage risks.