Philipp Koehn: Statistical Machine Translation.
Zhang Xiao-jun
DOI: https://doi.org/10.1093/applin/amr017
IF: 3.6
2011-01-01
Applied Linguistics
Abstract:‘Something old, something new, something borrowed, and something blue’, the wedding tradition saying can be quoted here to describe the main features of Statistical Machine Translation written by Philipp Koehn. As a new discipline, machine translation (MT) has only developed for 60 years, a history much shorter than that of linguistics and mathematics. Especially, statistical machine translation (SMT), a branch of MT, has only become popularized since the 1980s. However, the ‘old’ ingredients of linguistic and mathematical knowledge are necessary and a must in this new field. The foundation part of this book (the first three chapters) introduces the basic knowledge. Chapter 1 gives an overview of the book by providing a summary of each chapter and the layout of the book. This chapter also introduces the general history of MT, the possible applications of MT technologies including Fully-Automatic High-Quality Machine Translation (FAHQMT), gisting, speech translation, translation on hand-held devices, post-editing, and translating tools. Finally, it provides a list of available resources (tools and corpora) for Statistical Machine Translation readers. Chapter 2 introduces basic linguistic concepts related to morphology, syntax, and semantics from the point of view of SMT, including tokenization, Zipf’s law (an empirical formula for the distribution of words in a corpus), parts of speech, ambiguity, and different grammatical formalisms. It also discusses the acquisition of parallel corpora and aligned bilingual texts. Chapter 3 introduces the basic concepts of mathematical probability theory and information theory, which are essential in SMT. Binomial and normal distributions, chain rule, Bayes rule—a theorem on conditional probabilities, entropy, and mutual information, etc. are presented. Remarkably, in the second part (core methods) and the last part (advanced topics), Koehn emphasizes the application of linguistic knowledge and the derivation of mathematical formula when he introduces various models and methods. Chapter 10 entitled ‘Integrating Linguistic Information’ shows how morphological and syntactical linguistic information can be integrated into SMT systems. Three possible points of linguistic information integration are analyzed: as a pre-processing stage, as a post-processing stage and as an inner extension of phrase-based models. Some ideas on the transliteration of names in different languages (Chinese, Japanese, Latin, and Arabic) are also given in this chapter. These are something ‘old’ in this book.