A Survey on Regular Expression Matching for Deep Packet Inspection: Applications, Algorithms, and Hardware Platforms

Chengcheng Xu,Shuhui Chen,Jinshu Su,S. M. Yiu,Lucas C. K. Hui
DOI: https://doi.org/10.1109/comst.2016.2566669
2016-01-01
Abstract:Deep packet inspection (DPI) is widely used in content-aware network applications such as network intrusion detection systems, traffic billing, load balancing, and government surveillance. Pattern matching is a core and critical step in DPI, which checks the payload of each packet for known signatures (patterns) in order to identify packets with certain characteristics (e.g., malicious packets that carry viruses or worms). Regular expression is the major tool for signature description due to its powerful and flexible expressive ability. However, this flexibility also brings great challenges for efficient implementation in practice. Despite of hundreds to thousands of empirical proposals, wire-speed matching for large scale regular expressions still remains a big challenge. The gap between the matching throughput and the link speed is widening with the ever-increasing network link speed and pattern scale. This survey begins with a full-scale application background of DPI and technical background of regular expression matching in order to provide a global view and essential knowledge for readers. We then analyze the challenges in regular expression matching originated from the state explosion of finite state automaton used for regular expression matching. The nature of state explosion is analyzed in details, and the state-of-the-art solutions are grouped into categories of methods to relieve state expansion and methods to avoid state explosion, suggestions are also provided for building compact and efficient automata in different scenarios. Furthermore, proposals employing parallel platforms, including field-programmable gate array, GPU, general multi-processors, and ternary content addressable memory, to accelerate the matching process are introduced and thoroughly discussed. We also provide guidelines for efficient deployment for each of these platforms.
What problem does this paper attempt to address?