Forming Word Classes by Statistical Clustering for Statistical Language Modelling

Reinhard Kneser,Hermann Ney
DOI: https://doi.org/10.1007/978-94-011-1769-2_15
1993-01-01
Abstract:In statistical language modelling there is always a problem of sparse data. A way to reduce this problem is to form groups of words in order to get equivalence classes. In this paper we present a clustering algorithm that builds abstract word equivalence classes. The algorithm finds a local optimum according to a maximum-likelihood criterion. Experiments were made on an English 1.1-million word corpus and a German 100,000-word corpus. Compared to a word bigram model, the use of clustered equivalence classes in a bigram class model leads to a significant improvement, as measured by the perplexity. Depending on the size of the training material, the automatically clustered word classes are even better than manually determined categories.
What problem does this paper attempt to address?