Abstract:As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at <a class="link-external link-https" href="https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Analysis of the characteristics of long - text multimodal summaries**: With the enhanced ability of large - scale language models (LLMs) to process long - input texts, it becomes necessary to conduct a strict and systematic analysis of these models' ability to process long - text summaries. In particular, financial report summaries, because such documents are not only long in length but also widely use numbers and tables. The paper proposes a computational framework to characterize long - text multimodal summaries and explores the behavioral performance of different models (such as Claude 2.0/2.1, GPT - 4/3.5, and Cohere). 2. **Model performance evaluation**: The research found that GPT - 3.5 and Cohere perform poorly on this task, while Claude 2 and GPT - 4 show certain capabilities. For Claude 2 and GPT - 4, the author analyzed the extractability of the summaries and identified the position bias of the models when processing the input. This position bias disappeared for Claude after the input was shuffled, indicating that Claude can identify important information, while GPT - 4 continued to show a preference for the beginning part of the input. 3. **Utilization of numerical data and hallucinations**: Given the importance of numerical data in financial reports, the paper delved into how models use numerical data when generating summaries, proposed a classification method for numerical hallucinations, and attempted to improve GPT - 4's performance in using numbers through prompt engineering. The research shows that although models will experience numerical hallucinations in about 5% of cases, they still face challenges in capturing the semantic relationships between numerical data and their text descriptions. 4. **The influence of position bias**: The paper also explored the different behaviors of models when processing incoherent texts. For example, when the input report was randomly shuffled in paragraphs, the performance of Claude 2 showed the ability to identify important information, while GPT - 4 still tended to extract information from the beginning part of the input, even if this part had no practical meaning after being shuffled. In summary, this paper aims to reveal the capabilities and limitations of large - scale language models in handling long - text multimodal summary tasks through the study of the specific case of financial report summaries, especially in terms of information extraction, utilization of numerical data, and model behavior patterns.

Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports

Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study

CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels

CNNSum: Exploring Long-Conext Summarization with Large Language Models in Chinese Novels

Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

Advanced NLP Techniques for Summarizing Multilingual Financial Narratives from Global Annual Reports.

Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents

Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method

S UMM N : A Multi-Stage Summarization Framework for Long Input Dialogues and Documents

Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization

GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

LCFO: Long Context and Long Form Output Dataset and Benchmarking

Source Code Summarization in the Era of Large Language Models

Controllable Multi-document Summarization: Coverage & Coherence Intuitive Policy with Large Language Model Based Rewards

Tell me what I need to know: Exploring LLM-based (Personalized) Abstractive Multi-Source Meeting Summarization

TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale

Can Large Language Models Serve as Evaluators for Code Summarization?