TxLASM: A Novel Language Agnostic Summarization Model for Text Documents
Ahmed Abdelfattah Saleh,Weigang Li
DOI: https://doi.org/10.1016/j.eswa.2023.121433
IF: 8.5
2024-01-01
Expert Systems with Applications
Abstract:In Natural Language Processing (NLP) domain, the majority of automatic text summarization approaches depend on a prior knowledge of the language and/or the domain of the text being summarized. Such approaches requires language dependent part-of-speech taggers, parsers, databases, pre-structured lexicons, etc. In this research, we propose a novel automatic text summarization model, Text Documents - Language Agnostic Summarization Model (TxLASM), which is able to perform extractive text summarization task in language/domain agnostic manner. TxLASM depends on specific characteristics of the major elements of the text being summarized rather than its domain, context, or language and thus rules out the need for language dependent pre-processing tools, taggers, parsers, lexicons or databases. Within TxLASM, we present a novel technique for encoding the shapes of major text elements (paragraphs, sentences, n-grams and words); moreover, we present language independent preprocessing algorithms to normalize words and perform relative stemming or lemmatization. Those algorithms and its Shape-Coding technique enable the TxLASM to extract intrinsic features of text elements and score them statistically, and subsequently extract a representative summary that is independent of the text language, domain and context. TxLASM was applied on an English and Portuguese benchmark datasets, and the results were compared to twelve state-of-the-art approaches presented in recent literature. In addition, the model was applied on French and Spanish news datasets, and the results were compared to those obtained by standard commercial summarization tools. TxLASM has outperformed all the SOTA approaches as well as the commercial tools in all four languages while maintaining its language and domain agnostic nature.