Span-based Normalized Edit Distance Applied to Unsupervised Parsing
S. Dennis,S. Dennis,W. Oliver
Abstract:We present a span-based version of the normalized edit distance measure (Marzal & Vidal, 1993), which is more appropriate for linguistic tasks and give an O(nm) algorithm for its calculation. Span similarities used in the algorithm are derived by taking the cosines between the left vectors of a reduced singular value decomposition of a span by context matrix. To test the model, an exemplar-based approach is used to provide unsupervised parses of sentences from the Penn Treebank using nearest neighbour extraction based on a version of Locality Sensitive Hashing (Indyk & Motwani, 1998; Gionis, Indyk, & Motwani, 1999). Initial results indicate that the method provides parsing recall and precision equivalent to other unsupervised methods. Introduction The nativist/empiricist debate on the origin on language has been one of the longest and most hotly contested in the history of cognitive science (Pinker, 1994; Elman, 1999). On the one hand, languages are clearly learned at some level with a great many variations that differ in quite subtle ways. Furthermore, the difficulty in creating an explanation of how the genes might influence language development suggests that it is unlikely that our biological endowment has a direct influence (Elman, 1999). However, the fact that humans have a much more complex system of language than other primates, that there are similarities across the world’s languages and that language acquisition takes similar paths in different cultures suggest a strong innate component (Pinker, 1994). One key, if unstated, plank in the nativist case is that to this point no statistical learning procedure capable of capturing the syntax of a complex natural language has been devised (see Dennis, under review; Klein & Manning, 2001). While connectionist models have demonstrated an ability to solve restricted problems with toy corpora (Elman, 1991), issues such as systematicity and constituent formation and movement remain unresolved (Hadley, 1994) seriously undermining the empiricist position. In addition, from a practical perspective the inability to create syntactic analyses in an unsupervised fashion makes the application of natural language processing systems in new domains tedious. Either one must hand specify appropriate rules or one must create annotated corpora on which to train systems. Both of these tasks are difficult and time consuming. In this paper, we outline attempts to improve an exemplar-based model of unsupervised parsing proposed by Dennis (under review) using spanbased normalized edit distance (SNED). We start by defining normalized edit distance and the spanbased modification. Then we discuss how one can calculate the span similarities necessary to apply the method to sentences. Next we describe a version of Locality Sensitive Hashing (Indyk & Motwani, 1998; Gionis et al., 1999) adapted to work with part of speech strings. Finally, we present recall and precision parsing data on sentences drawn from the Penn Treebank (Marcus et al., 1993). Definitions of Edit Distances Edit Distance Following the notation of Marzal and Vidal (1993), let Σ be a finite alphabet and Σ∗ be the set of all finite-length strings over Σ. Let X = X1X2...Xn be a string of Σ∗, where Xi is the ith symbol of X. We denote by Xi...j the substring of X that includes the symbols from Xi to Xj , 1 ≤ i, j ≤ n. The length of such a string is |Xi...j | = j − i + 1. If i > j,Xi...j is the null string λ, |λ| = 0. An elementary edit operation is a pair (a, b) 6= (λ, λ), where a and b are strings of length 0 or 1. The edit operations are termed insertions (λ, b), substitutions (a, b) and deletions (a, λ). An edit transformation of X into Y is a sequence S of elementary operations that transforms X into Y . Typically, edit operations have associated costs γ(a, b). The function γ can be extended to edit transformations S = S1S2...Sl by letting γ(S) = ∑l i=1 γ(Si). Given X, Y ∈ Σ∗ and S∗ XY the set of all edit transformations of X into Y , then the edit distance is defined as: δ(X, Y ) = min{γ(S)|S ∈ S∗ XY } (1) Note that the triangle inequality is a consequence of this definition, so provided γ(a, a) = 0, γ(a, b) > 0, if a 6= b, and γ(a, b) = γ(b, a)∀a, b ∈ Σ∪ {λ}, δ is a metric. Dynamic programming algorithms of complexity O(mn), where n is the length of X and m is the length of Y , exist to calculate edit distance and to retrieve minimal edit transformations (Wagner & Fischer, 1974). Normalized Edit Distance Let L(S) be the length of a given edit transformation, then the normalized edit distance defined by Marzal and Vidal (1993) is: d(X,Y ) = min{γ(S)/L(S)|S ∈ S∗ XY } (2) Note that normalized edit distance is not a metric. It can, however, be calculated in O(nm2) time using an algorithm provided by Marzal and Vidal (1993). Marzal and Vidal (1993) also show that NED does not produce the same answer as postnormalizing, by finding the minimum path and dividing by its length. Furthermore, for a handwritten character recognition task, normalized edit distance produced better performance than either normal edit distance or post normalized edit distance. Span-based Normalized Edit Distance (SNED) While the normalized edit distance has proven successful in a number of tasks, when analyzing sentence structure we would prefer a version of the algorithm that aligns spans of symbols rather than individual symbols. Providing a definition of spanbased edit distance involves relaxing the restriction in the normal algorithm, so that the strings a and b are drawn from Σ∗1. So, the edit operations become (a, b) = (Xi...j , Yk...l) for 0 ≤ i ≤ j ≤ n, 0 ≤ k ≤ l ≤ m. Similarly, one can define spanbased normalized edit distance in an analogous way. The appendix provides an algorithm capable of calculating the span-based normalized edit distance with time complexity O(n2m3) and space complexity O(nm2). Exemplar-based Parsing The algorithm that we employ for parsing sentences is a version of the Syntagmatic Paradigmatic model (Dennis, in press, 2004, under review). In this model, sentence parsing involves aligning near neighbour exemplar sentences from memory with the target sentence. For instance, suppose we wish to parse the sentence ”His dog was big.” (see Figure 1). We start by converting the sentence to a part of speech (POS) sequence ”PRP$ NN VBD JJ”, where PRP$ = possessive pronoun, NN = noun, VBD = past tense verb and JJ = adjective. Next we identify near neighbour POS sequences from a For the current purposes, we assume that a, b 6= λ although it would be useful to draw a and b from Σ∗ ∪ {λ} as an alternative formulation. S