Abstract:Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search's drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files - y automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1% in tier-1, identify 16.06% of sensitive files in tier-2, and classify files with 91% precision, 95% sensitivity, 91% accuracy, and 0.11% Hamming loss compared to the two-tier system.

Latent Text Mining for Cybercrime Forensics

Event Evolution Model for Cybersecurity Event Mining in Tweet Streams

A Probabilistic Generative Model for Mining Cybercriminal Networks from Online Social Media

A Survey of Relevant Text Mining Technology

Mining user interaction patterns in the darkweb to predict enterprise cyber incidents

Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter Stream

Towards Characterizing Cyber Networks with Large Language Models

Is the Digital Forensics and Incident Response Pipeline Ready for Text-Based Threats in LLM Era?

Automatic crime prediction using events extracted from twitter posts

SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework

Identifying and Profiling Key Sellers in Cyber Carding Community: AZSecure Text Mining System

A Multi-Layer Semantic Approach for Digital Forensics Automation for Online Social Networks

Threat Behavior Textual Search by Attention Graph Isomorphism

Discovering Emerging Threats in the Hacker Community: A Nonparametric Emerging Topic Detection Framework

Detecting Cyber-Related Discussions in Online Social Platforms

A Novel Network Forensic Framework for Advanced Persistent Threat Attack Attribution Through Deep Learning

Darknet Data Mining -- A Canadian Cyber-crime Perspective

Machine Learning and Deep Learning Methods for Cybersecurity

Predicting Cyber Events by Leveraging Hacker Sentiment

Demystifying Cryptocurrency Mining Attacks: A Semi-supervised Learning Approach Based on Digital Forensics and Dynamic Network Characteristics