Abstract:Text summarization is a downstream natural language processing (NLP) task that challenges the understanding and generation capabilities of language models. Considerable progress has been made in automatically summarizing short texts, such as news articles, often leading to satisfactory results. However, summarizing long documents remains a major challenge. This is due to the complex contextual information in the text and the lack of open-source benchmarking datasets and evaluation frameworks that can be used to develop and test model performance. In this work, we use ChatGPT, the latest breakthrough in the field of large language models (LLMs), together with the extractive summarization model C2F-FAR (Coarse-to-Fine Facet-Aware Ranking) to propose a hybrid extraction and summarization pipeline for long documents such as business articles and books. We work with the world-renowned company getAbstract AG and leverage their expertise and experience in professional book summarization. A practical study has shown that machine-generated summaries can perform at least as well as human-written summaries when evaluated using current automated evaluation metrics. However, a closer examination of the texts generated by ChatGPT through human evaluations has shown that there are still critical issues in terms of text coherence, faithfulness, and style. Overall, our results show that the use of ChatGPT is a very promising but not yet mature approach for summarizing long documents and can at best serve as an inspiration for human editors. We anticipate that our work will inform NLP researchers about the extent to which ChatGPT's capabilities for summarizing long documents overlap with practitioners' needs. Further work is needed to test the proposed hybrid summarization pipeline, in particular involving GPT-4, and to propose a new evaluation framework tailored to the task of summarizing long documents.

Automatic Abstraction of Long Chinese Patent Texts Based on P-Bertsum Model

Automatic summarization of long text of Chinese patents based on PatBertsum model

An Automatic Generation Method of Patent Specification Abstract Based on "Extraction- Abstraction "Model

Exploiting Semantic Knowledge Base for Patent Retrieval

A Semantic Query Expansion-Based Patent Retrieval Approach

An Ontology-Based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design

The patent mining analysis method based on Chinese word segmentation

The Master-Slave Encoder Model for Improving Patent Text Summarization: A New Approach to Combining Specifications and Claims

BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization

Towards Accurate Word Segmentation for Chinese Patents

Automatic patent document summarization for collaborative knowledge systems and services

A Patent Keyword Extraction Method Based on Corpus Classification

A patent retrieval method based on automatic query expansion

Patent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification

Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Efficient Two-stage Approach for Long Document Summarization

Domain Lexicon-Based Query Expansion for Patent Retrieval

Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study

Automatic Summarization of Long Documents

PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT

Building a Large English-Chinese Parallel Corpus from Comparable Patents and Its Experimental Application to SMT