Spam Filtering Based on Latent Semantic Indexing

Wilfried N. GanstererAndreas,Andreas G. K. Janecek,Robert Neumayer
DOI: https://doi.org/10.1007/978-1-84800-046-9_9
2008-01-01
Abstract:In this chapter, the classification performance of latent semantic indexing (LSI) applied to the task of detecting and filtering unsolicited bulk or commercial email (UBE, UCE, commonly called "spam") is studied. Comparisons to the simple vector space model (VSM) and to the extremely widespread, de-facto standard for spam filtering, the SpamAssassin system, are summarized. It is shown that VSM and LSI achieve significantly better classification results than SpamAssassin. Obviously, the classification performance achieved in this special application context strongly depends on the feature sets used. Consequently, the various classification methods are also compared using two different feature sets: (1) a set of purely textual features of email messages that are based on standard word- and token-extraction techniques, and (2) a set of application-specific "meta features" of email messages as extracted by the SpamAssassin system. It is illustrated that the latter tends to achieve consistently better classification results. A third central aspect discussed in this chapter is the issue of problem reduction in order to reduce the computational effort for classification, which is of particular importance in the context of time-critical on-line spam filtering. In particular, the effects of truncation of the SVD in LSI and of a reduction of the underlying feature set are investigated and compared. It is shown that a surprisingly large amount of problem reduction is often possible in the context of spam filtering without heavy loss in classification performance.
What problem does this paper attempt to address?