Abstract:Multiple algorithms are known for efficiently calculating the prefix probability of a string under a probabilistic context-free grammar (PCFG). Good algorithms for the problem have a runtime cubic in the length of the input string. However, some proposed algorithms are suboptimal with respect to the size of the grammar. This paper proposes a novel speed-up of Jelinek and Lafferty's (1991) algorithm, whose original runtime is $O(n^3 |N|^3 + |N|^4)$, where $n$ is the input length and $|N|$ is the number of non-terminals in the grammar. In contrast, our speed-up runs in $O(n^2 |N|^3+n^3|N|^2)$.

What problem does this paper attempt to address?

The paper primarily addresses the problem of efficiently computing the prefix probability of strings in Probabilistic Context-Free Grammars (PCFGs). Specifically, the authors propose an improved version of the Jelinek and Lafferty (1991) algorithm to enhance computational efficiency. The main contributions of the paper can be summarized as follows: 1. **Problem Background**: - Probabilistic Context-Free Grammars (PCFGs) are widely used in Natural Language Processing (NLP) for building language models. - When using PCFG as a language model, it is necessary to compute the prefix probability, i.e., the probability that the grammar generates a given string as the beginning of a derivation. - Existing efficient algorithms include those proposed by Jelinek and Lafferty (1991) and Stolcke (1995), but these algorithms are less efficient when dealing with larger grammars. 2. **Proposed Method**: - The paper proposes a new algorithm that improves upon the original Jelinek and Lafferty algorithm, enhancing computational efficiency for dense grammars. - The improved algorithm has a time complexity of $O(N^2|N|^3 + N^3|N|^2)$, where $N$ is the length of the input string and $|N|$ is the number of non-terminals. - This time complexity is better than the original algorithm's $O(N^3|N|^3 + |N|^4)$, especially when dealing with dense grammars containing a large number of non-terminals. 3. **Technical Details**: - By reorganizing the computation formula for prefix probabilities and introducing additional memoization strategies, the algorithm reduces redundant calculations. - The CKY algorithm is used to precompute "inside probabilities," and additional data structures $\gamma$ and $\delta$ are utilized to store intermediate results, further improving computational efficiency. 4. **Scope of Application**: - The proposed algorithm is applicable to PCFGs in Chomsky Normal Form (CNF) and can be extended to Semiring-Weighted CFGs. In summary, the main contribution of this paper is the proposal of a more efficient algorithm for computing prefix probabilities under PCFGs, particularly demonstrating better performance when handling large-scale grammars. This has significant implications for the construction of language models in practical applications.

A Fast Algorithm for Computing Prefix Probabilities

Fast and Compact Prefix Codes

Faster Prefix-Sorting Algorithms for Deterministic Finite Automata

Prefix Sorting DFAs: a Recursive Algorithm

Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

Frontiers in Algorithmics

Online Computation of String Net Frequency

#CFG and #DNNF admit FPRAS

On the Computation of Distances for Probabilistic Context-Free Grammars

Revisiting the Folklore Algorithm for Random Access to Grammar-Compressed Strings

Fast Prefix Adders for Non-Uniform Input Arrival Times

Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing

Online algorithms for finding distinct substrings with length and multiple prefix and suffix conditions

PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols

Language Edit Distance & Scored Parsing: Faster Algorithms & Connection to Fundamental Graph Problems

Optimal prefix-suffix queries with applications

A faster FPRAS for #NFA

Faster and Simpler Online Computation of String Net Frequency

From Exponential to Polynomial Complexity: Efficient Permutation Counting with Subword Constraints

Generalized Fixed-Depth Prefix and Postfix Symbolic Regression Grammars

Rank, select and access in grammar-compressed strings