Language Independent Text Summarization of Western European Languages Using Shape Coding of Text Elements

Ahmed A. Saleh,Li Weigang
DOI: https://doi.org/10.1109/fskd.2017.8393116
2017-01-01
Abstract:The majority of text summarization techniques in literature depend, in one way or another, on language dependent pre-structured lexicons, databases, taggers and/or parsers. Such techniques require a prior knowledge of the language of the text being summarized. In this paper we propose an extractive text summarization tool, UnB Language Independent Text Summarizer (UnB-LITS), which is capable of performing text summarization in a language independent manner. The new model depends on intrinsic characteristics of the text being summarized rather than its language and thus eliminates the need for language dependent lexicons, databases, taggers or parsers. Within this tool, we develop an innovative way of coding the shapes of text elements (words, n-grams, sentences and paragraphs), in addition to proposing language independent algorithms that is capable of normalizing words and performing relative stemming or lemmatization. The proposed algorithms and Shape-Coding routine enable the UnB-LITS tool to extract intrinsic features of document elements and score them statistically to extract a representative extractive summary independent of the document language. In this paper we focused on single document summarization of western European languages. The tool was tested on hundreds of documents written in English, Portuguese, French and Spanish and showed better performance as compared with the results obtained in literature as well as from commercial summarizers.
What problem does this paper attempt to address?