Abstract:Text summarization creates subset that represents the most important or relevant information in the original content, which effectively reduce information redundancy. Recently neural network method has achieved good results in the task of text summarization both in Chinese and English, but the research of text summarization in low-resource languages is still in the exploratory stage, especially in Tibetan. What???s more, there is no large-scale annotated corpus for text summarization. The lack of dataset severely limits the development of low-resource text summarization. In this case, unsupervised learning approaches are more appealing in low-resource languages as they do not require labeled data. In this paper, we propose an unsupervised graph-based Tibetan multi-document summarization method, which divides a large number of Tibetan news documents into topics and extracts the summarization of each topic. Summarization obtained by using traditional graph-based methods have high redundancy and the division of documents topics are not detailed enough. In terms of topic division, we adopt two level clustering methods converting original document into document-level and sentence-level graph, next we take both linguistic and deep representation into account and integrate external corpus into graph to obtain the sentence semantic clustering. Improve the shortcomings of the traditional K-Means clustering method and perform more detailed clustering of documents. Then model sentence clusters into graphs, finally remeasure sentence nodes based on the topic semantic information and the impact of topic features on sentences, higher topic relevance summary is extracted. In order to promote the development of Tibetan text summarization, and to meet the needs of relevant researchers for high-quality Tibetan text summarization datasets, this paper manually constructs a Tibetan summarization dataset and carries out relevant experiments. The experiment results show that our method can effectively improve the quality of summarization and our method is competitive to previous unsupervised methods.

A dataset of Tibetan text summarization

Unsupervised Graph-Based Tibetan Multi-Document Summarization

LCSTS: A Large Scale Chinese Short Text Summarization Dataset

CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries

CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level

A Pre-Trained Language Model Based on LED for Tibetan Long Text Summarization

TGSum: Build Tweet Guided Multi-Document Summarization Dataset

Surveying the Landscape of Text Summarization with Deep Learning: A Comprehensive Review

NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization

IndoSum: A New Benchmark Dataset for Indonesian Text Summarization

DialogSum: A Real-Life Scenario Dialogue Summarization Dataset

Identifying High Quality Document-Summary Pairs Through Text Matching

The State and Fate of Summarization Datasets

How Well Do You Know Your Summarization Datasets?

A Comprehensive Survey of Abstractive Text Summarization Based on Deep Learning

A Dataset for Exploring Gaze Behaviors in Text Summarization.

CNTLS: A Benchmark Dataset for Abstractive or Extractive Chinese Timeline Summarization

From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information

A New Uighur Automatic Summarization Method

Topic-based automatic summarization algorithm for Chinese short text