Sentence Alignment Based on the Text Length Between Punctuation Marks

Mohamed Abdel Fattah,Fuji Ren
2008-01-01
Information
Abstract:Parallel corpora have become an essential resource for work in multi lingual natural language processing. Sentence aligned parallel corpora are more efficient than non-aligned parallel corpora for cross language information retrieval and machine translation applications. In this paper, we present a new approach to align sentences in bilingual parallel corpora based on the text character length between successive punctuation marks. A probabilistic score is assigned to each proposed correspondence of texts, based on the scaled difference of lengths of the two texts (in characters) and the variance of this difference. Using this score, the time required for punctuation marks matching decreased and the sentence alignment accuracy increased. Using this new approach, we could achieve an error reduction of 26.5% over length based approach when applied on English-Arabic parallel documents. The sentence alignment execution time decreased to 17% of the total time required for the combined model which uses length based approach and punctuation approach combined together. Moreover, the proposed approach result outperforms Melamed and Moore's approach results.
What problem does this paper attempt to address?