CRATOR: a Dark Web Crawler

Daniel De Pascale,Giuseppe Cascavilla,Damian A. Tamburri,Willem-Jan Van Den Heuvel
2024-05-10
Abstract:Dark web crawling is a complex process that involves specific methodologies and techniques to navigate the Tor network and extract data from hidden services. This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas, efficiently. Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content. We also incorporate methods for user-agent rotation and proxy usage to maintain anonymity and avoid detection. We evaluate the effectiveness of our crawler using metrics such as coverage, performance and robustness. Our results demonstrate that our crawler effectively extracts pages handling security protocols while maintaining anonymity and avoiding detection. Our proposed dark web crawler can be used for various applications, including threat intelligence, cybersecurity, and online investigations.
Cryptography and Security
What problem does this paper attempt to address?
This paper presents a solution to the challenges of crawling the dark web, especially those pages that involve security protocols such as captchas. Currently, dark web crawlers face difficulties in anonymity, navigation complexity, and data extraction efficiency. The research team has developed a universal crawler called CRATOR, which combines seed URL lists, link analysis, and scanning to discover new content. It also incorporates user agent rotation and proxy usage to maintain anonymity and avoid detection. The main goal of CRATOR is to effectively crawl pages that involve security protocols while preserving anonymity. Its design takes into account the unstructured nature of the dark web and the need to handle login forms and simple captchas in the security layer. Experimental results show that CRATOR outperforms existing open-source tool ACHE in terms of coverage, performance, and robustness. The paper introduces the architecture of CRATOR, including key components such as link validity check, login form handling, captcha detection, connection settings, and stopping conditions. The evaluation section compares the performance of CRATOR and ACHE, revealing that CRATOR performs better in terms of downloaded page count, execution time, and error handling. The contribution of CRATOR lies in providing a dedicated crawler architecture specifically for the dark web, which can be used in various applications such as threat intelligence, network security, and online investigations. Future research directions may include further optimizing the anonymity and adaptability of the crawler, as well as expanding its capabilities in dark web data collection and analysis.