What is in the news on a subject : automatic and sparse summarization of large document corpora

Luke Miratrix,Jinzhu Jia,Brian Gawalt,Bin Yu,Laurent El Ghaoui
2011-01-01
Abstract:News media play a significant role in our political and daily lives. The traditional approach in media analysis to news summarization is labor intensive. As the amount of news data grows rapidly, the need is acute for automatic and scalable methods to aid media analysis researchers so that they could screen corpora of news articles very quickly before detailed reading. In this paper we propose a general framework for subject-specific summarization of document corpora with news articles as a special case. We use the state-of-the art scalable and sparse statistical predictive framework to generate a list of short words/phrases as a summary of a subject. In particular, for a particular subject of interest (e.g., China), we first create a list of words/phrases to represent this subject (e.g., China, Chinas, and Chinese) and then create automatic labels for each document depending on the appearance pattern of this list in the document. The predictor vector is then high dimensional and contains counts of the rest of the words/phrases in the documents excluding phrases overlapping the subject list. Moreover, we consider several preprocessing schemes, including document unit choice, labeling scheme, tf-idf representation and L2 normalization, to prepare the text data before applying the sparse predictive framework. We examined four different scalable feature selection methods for summary list generation: phrase Co-occurrence, phrase correlation, L1-regularized logistic regression (L1LR), and L1-regularized linear regression (Lasso). We carefully designed and conducted a human survey to compare the different summarizers with human understanding based on news * Miratrix and Jia are co-first authors.
What problem does this paper attempt to address?