Practical Regular Expression Matching Free of Scalability and Performance Barriers

Kai Wang,Zhe Fu,Xiaohe Hu,Jun Li
DOI: https://doi.org/10.1016/j.comcom.2014.08.005
IF: 5.047
2014-01-01
Computer Communications
Abstract:Regular expressions (regexes) provide rich expressiveness to specify the signatures of intrusions and are widely used in contemporary network security systems for signature-based intrusion detection. To perform very fast regex matching, deterministic finite automata (DFA) has been the first choice because its time complexity is constant O ( 1 ) . Unfortunately, DFA often suffers the well known state explosion problem and, consequently, tends to require prohibitive memory overhead in practical applications. To address the problem, a wide variety of DFA compression techniques have been proposed; however, few can keep up with the ever increasing network traffic bandwidth and regex set complexity. This paper proposes that the DFA problem is rooted in regexes (rather than in DFA), i.e., semantic overlapping of regexes, and accordingly presents a complete algorithmic solution PaCC (Partition, Compression, and Combination), that can transform the given large-scale set of complex regexes into a compact and fast matching engine using DFA as its core. PaCC fundamentally defuses state explosion for DFA by partitioning complex regexes into overlapping-free segments. By exploiting the massive repetitiveness among the resulting segments, PaCC can further deflate corresponding DFA in terms of the number of states. Moreover, on the basis of the characteristics of these segments, PaCC takes a tailor-made compression approach and reduces over 96% of the state transitions for the corresponding DFA. In the final matching engine, the combination of DFA and a small relation mapping table, built from segments and their syntagmatic relations, respectively, guarantees high performance and semantic equivalence. Experimental evaluation shows that PaCC produces succinct matching engines with memory usage proportional to the size of the real-world Snort and Bro regex sets, with speeds of up to 1.7Gbps per core on a HP Z220 SFF workstation with a 3.40GHz Intel Core i7-3770.
What problem does this paper attempt to address?