Abstract:In Natural Language Processing (NLP) domain, the majority of automatic text summarization approaches depend on a prior knowledge of the language and/or the domain of the text being summarized. Such approaches requires language dependent part-of-speech taggers, parsers, databases, pre-structured lexicons, etc. In this research, we propose a novel automatic text summarization model, Text Documents - Language Agnostic Summarization Model (TxLASM), which is able to perform extractive text summarization task in language/domain agnostic manner. TxLASM depends on specific characteristics of the major elements of the text being summarized rather than its domain, context, or language and thus rules out the need for language dependent pre-processing tools, taggers, parsers, lexicons or databases. Within TxLASM, we present a novel technique for encoding the shapes of major text elements (paragraphs, sentences, n-grams and words); moreover, we present language independent preprocessing algorithms to normalize words and perform relative stemming or lemmatization. Those algorithms and its Shape-Coding technique enable the TxLASM to extract intrinsic features of text elements and score them statistically, and subsequently extract a representative summary that is independent of the text language, domain and context. TxLASM was applied on an English and Portuguese benchmark datasets, and the results were compared to twelve state-of-the-art approaches presented in recent literature. In addition, the model was applied on French and Spanish news datasets, and the results were compared to those obtained by standard commercial summarization tools. TxLASM has outperformed all the SOTA approaches as well as the commercial tools in all four languages while maintaining its language and domain agnostic nature.

A Comprehensive Method for Text Summarization Based on Latent Semantic Analysis

Automatic Text Summarization Based on Latent Semantic Indexing

Research on automatic text summarization based on latent semantic indexing

An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization

Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis

Integrating Extractive and Abstractive Models for Long Text Summarization

Topic-Aware Abstractive Text Summarization

Research on Multi-Document Summarization Based on Latent Semantic Indexing

Creating Generic Text Summaries

Text Summarization Based on Sentence Selection with Semantic Representation

SEASum: Syntax-Enriched Abstractive Summarization

Text Summarization Using Sentence-Level Semantic Graph Model

GATSum: Graph-Based Topic-Aware Abstract Text Summarization

Comparative summarization via Latent Semantic Analysis

English automatic text summarization

An Enhanced Lsa-Based Approach for Update Summarization

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

Topic Modeling Based Text Summarization Approach

Using Topic Labels for Text Summarization.

Topic-based Visual Text Summarization and Analysis 1

TxLASM: A Novel Language Agnostic Summarization Model for Text Documents