Abstract:News media play a significant role in our political and daily lives. The traditional approach in media analysis to news summarization is labor intensive. As the amount of news data grows rapidly, the need is acute for automatic and scalable methods to aid media analysis researchers so that they could screen corpora of news articles very quickly before detailed reading. In this paper we propose a general framework for subject-specific summarization of document corpora with news articles as a special case. We use the state-of-the art scalable and sparse statistical predictive framework to generate a list of short words/phrases as a summary of a subject. In particular, for a particular subject of interest (e.g., China), we first create a list of words/phrases to represent this subject (e.g., China, Chinas, and Chinese) and then create automatic labels for each document depending on the appearance pattern of this list in the document. The predictor vector is then high dimensional and contains counts of the rest of the words/phrases in the documents excluding phrases overlapping the subject list. Moreover, we consider several preprocessing schemes, including document unit choice, labeling scheme, tf-idf representation and L2 normalization, to prepare the text data before applying the sparse predictive framework. We examined four different scalable feature selection methods for summary list generation: phrase Co-occurrence, phrase correlation, L1-regularized logistic regression (L1LR), and L1-regularized linear regression (Lasso). We carefully designed and conducted a human survey to compare the different summarizers with human understanding based on news * Miratrix and Jia are co-first authors.

LCSTS: A Large Scale Chinese Short Text Summarization Dataset

CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level

CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries

TGSum: Build Tweet Guided Multi-Document Summarization Dataset

Overview of the NLPCC 2015 Shared Task: Weibo-Oriented Chinese News Summarization

CNTLS: A Benchmark Dataset for Abstractive or Extractive Chinese Timeline Summarization

A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models

Long Text and Multi-Table Summarization: Dataset and Method

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

A Survey on Cross-Lingual Summarization

Concise Comparative Summaries (CCS) of Large Text Corpora with a Human Experiment

What is in the news on a subject : automatic and sparse summarization of large document corpora

CiteSum: Citation Text-guided Scientific Extreme Summarization and Domain Adaptation with Limited Supervision

NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization

BookSum: A Collection of Datasets for Long-form Narrative Summarization

IndoSum: A New Benchmark Dataset for Indonesian Text Summarization

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

LMGQS: A Large-scale Dataset for Query-focused Summarization