REGAP: A Tool for Unicode-Based Web Identity Fraud Detection

Anthony Y. Fu,Xiaotie Deng,Liu Wenyin
DOI: https://doi.org/10.1080/15567280600995501
2006-01-01
Journal of Digital Forensic Practice
Abstract:ABSTRACT We anticipate the widespread usage of an internationalized resource identifier (IRI) 1 1. IRI is a generalization of the uniform resource identifier (URI), which is in turn a generalization of the uniform resource locator (URL). While URIs are limited to a subset of the ASCII character set, IRIs may contain characters from the universal character set (Unicode/ISO 10646). Basically, an IRI is the internationalized version of a URI or internationalized domain name (IDN) 2 2. IDN is an Internet domain name that (potentially) contains non-ASCII characters. Such domain names could contain letters with diacritics, as required by many European languages, or characters from non-Latin scripts such as Arabic or Chinese. on the web as complement to universal resource identifier (URI). IRI/IDN is composed of characters in a subset of Unicode, such that a Unicode attack 3 3. Unicode attack is caused by the coexistence of a large number of visual/semantically similar Unicode strings. On the character level, the visually similar Unicode attack is homograph attack. to IRI/IDN could happen. Hence, visually or semantically, certain phishing IRI/IDNs may show high similarity to the real ones. The potential phishing attacks based on this strategy are very likely to happen in the near future with the boosting utilization of IRI/IDN. We invented a method to detect such phishing attack. We constructed a unicode character similarity list (UC-SimList) based on char-char visual and semantic similarities and use a nondeterministic finite automaton (NFA) 4 4. NFA is a finite state machine where for each pair of state and input symbol there may be several possible next states. We can use it to recognize a string of a certain pattern. When the last input symbol is consumed the NFA accepts if and only if there is some set of transitions it could make that will take it to an accepting state. Equivalently, it rejects if no matter what choices it makes it would not end in an accepting state. to identify the potential IRI/IDN-based phishing patterns. We implemented a phishing IRI/IDN pattern generation tool, REGAP, by which phishing IRI/IDN patterns can be generated into regular expressions (RE) for phishing IRI/IDN detection. We also address how such a tool can be applied to investigations.
What problem does this paper attempt to address?