Yifei Huang,Matin Amini,Alexis Le Glaunec,Konstantinos Mamouras,Mukund Raghothaman
Abstract:SMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem:
First, We present a two-pass NFA-based algorithm to determine whether a string $w$ matches a semantic regular expression (SemRE) $r$ in $O(|r|^2 |w|^2 + |r| |w|^3)$ time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in $O(|r|^2 |w|^2)$ time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a $\approx 2 \times$ overhead over the time needed for interaction with the oracle.
Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an $\Omega(|w|^2)$ lower bound on the number of oracle queries necessary to make this determination.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the **Membership Testing for Semantic Regular Expressions (SemREs)**. Specifically, the paper focuses on how to efficiently determine whether a string \( w \) matches a given semantic regular expression \( r \).
### Problem Background
Traditional regular expressions are mainly used to describe simple syntactic structures in text, such as identifying character sequences and matching delimiter boundaries. However, for pattern matching involving semantic concepts (such as cities, celebrities, political entities, etc.), traditional regular expressions are inadequate. To solve this problem, Chen et al. proposed the concept of **Semantic Regular Expressions (SemREs)** in 2023, which extends classical regular expressions and allows querying external oracles (such as databases and large - language models, LLMs) to identify these semantic concepts.
### Main Contributions of the Paper
1. **Proposed a two - pass NFA - based algorithm**: This algorithm is used to determine whether a string \( w \) matches a semantic regular expression \( r \), with a time complexity of \( O(|r|^2|w|^2+|r||w|^3) \), where \( |r| \) and \( |w| \) represent the length of the semantic regular expression and the length of the string respectively. In common cases (i.e., when oracle queries are not nested), the time complexity of this algorithm is \( O(|r|^2|w|^2) \).
2. **Established a connection with the triangle - finding problem in graph theory**: This indicates that developing algorithms that are both practically feasible and asymptotically faster may be challenging. In addition, the paper also proves that in the worst - case scenario, at least \( \Omega(|w|^2) \) oracle queries are required to complete the membership test.
3. **Experimental evaluation**: The proposed algorithm was experimentally evaluated using a benchmark dataset, and the results show that the throughput of this algorithm is 101 times that of the dynamic programming baseline algorithm, and it only requires about 51% of the oracle query times of the latter.
### Formula Summary
- Time complexity formula:
\[
O(|r|^2|w|^2 + |r||w|^3)
\]
In the case of non - nested queries:
\[
O(|r|^2|w|^2)
\]
- Lower bound of oracle queries in the worst - case scenario:
\[
\Omega(|w|^2)
\]
By solving the membership testing problem of semantic regular expressions, this paper provides a theoretical basis and practical algorithms for efficiently processing text pattern matching involving semantic concepts.