Smishing Dataset I: Phishing SMS Dataset from Smishtank.com

Daniel Timko,Muhammad Lutfor Rahman
DOI: https://doi.org/10.1145/3626232.3653282
2024-04-29
Abstract:While smishing (SMS Phishing) attacks have risen to become one of the most common types of social engineering attacks, there is a lack of relevant smishing datasets. One of the biggest challenges in the domain of smishing prevention is the availability of fresh smishing datasets. Additionally, as time persists, smishing campaigns are shut down and the crucial information related to the attack are lost. With the changing nature of smishing attacks, a consistent flow of new smishing examples is needed by both researchers and engineers to create effective defenses. In this paper, we present the community-sourced smishing datasets from the
Cryptography and Security
What problem does this paper attempt to address?
This paper introduces a dataset called "SmishingDataset I: Phishing SMS Dataset from Smishtank.com" which aims to address the lack of data related to Smishing attacks. Smishing is a type of social engineering attack conducted through SMS and has become a common threat. Due to the constantly changing strategies of these attacks, researchers and engineers need up-to-date Smishing data to develop effective defense measures. The main contributions of this paper are as follows: 1. It provides a publicly available dataset consisting of 1062 Smishing samples, sourced from community submissions, including sender information, message contents, mentioned brands, and URL-related analysis. 2. The dataset is categorized and parsed based on community submissions, generating relevant data fields. 3. Message metadata, such as VirusTotal and domain WHOIS information, is collected at the time of submission, enabling researchers to better understand fresh Smishing messages. The research methodology involves using OCR technology to extract text from SMS screenshots, parsing message senders, extracting URLs, and conducting WHOIS requests and VirusTotal scans to obtain historical information. Additionally, the messages are classified, URL categories and top-level domains are analyzed, and brand information is studied. The paper also discusses related work, emphasizing the importance of new data in preventing Smishing attacks, and points out limitations of the current dataset such as the closure of associated websites due to outdated data and the consideration of Smishing as part of spam by some researchers. Finally, the paper describes the distribution of the dataset, including sender types, brand mentions, message categories, URL categories, and top-level domains, and provides statistics on VirusTotal detection scores. This dataset serves as a new resource for the academia and industry to combat Smishing attacks.