Examining Long-Context Large Language Models for Environmental Review Document Comprehension

Hung Phan,Anurag Acharya,Rounak Meyur,Sarthak Chaturvedi,Shivam Sharma,Mike Parker,Dan Nally,Ali Jannesari,Karl Pazdernik,Mahantesh Halappanavar,Sai Munikoti,Sameera Horawalavithana

2024-10-16

Abstract:As LLMs become increasingly ubiquitous, researchers have tried various techniques to augment the knowledge provided to these models. Long context and retrieval-augmented generation (RAG) are two such methods that have recently gained popularity. In this work, we examine the benefits of both of these techniques by utilizing question answering (QA) task in a niche domain. While the effectiveness of LLM-based QA systems has already been established at an acceptable level in popular domains such as trivia and literature, it has not often been established in niche domains that traditionally require specialized expertise. We construct the NEPAQuAD1.0 benchmark to evaluate the performance of five long-context LLMs -- Claude Sonnet, Gemini, GPT-4, Llama 3.1, and Mistral -- when answering questions originating from Environmental Impact Statements prepared by U.S. federal government agencies in accordance with the National Environmental Environmental Act (NEPA). We specifically measure the ability of LLMs to understand the nuances of legal, technical, and compliance-related information present in NEPA documents in different contextual scenarios. We test the LLMs' internal prior NEPA knowledge by providing questions without any context, as well as assess how LLMs synthesize the contextual information present in long NEPA documents to facilitate the question/answering task. We compare the performance of the models in handling different types of questions (e.g., problem-solving, divergent, etc.). Our results suggest that RAG powered models significantly outperform those provided with only the PDF context in terms of answer accuracy, regardless of the choice of the LLM. Our further analysis reveals that many models perform better answering closed type questions (Yes/No) than divergent and problem-solving questions.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use large - language models (LLM) for effective question - answering tasks in long documents in specific fields. Specifically, the paper focuses on the application in Environmental Impact Statement (EIS) documents, which are usually very long and contain complex legal, technical and compliance information. The paper constructs a new benchmark dataset (NEPAQuAD1.0) to evaluate the performance of five long - context LLMs (Claude Sonnet, Gemini, GPT - 4, Llama 3.1, and Mistral) under different context conditions, especially the capabilities of these models when dealing with different types of questions, such as closed - ended questions, divergent questions, problem - solving questions, etc. In addition, the paper also compares the impact of providing only PDF content and using Retrieval - Augmented Generation (RAG) - provided context on model performance. The main contributions of the paper include: 1. Creating the first preliminary benchmark (NEPAQuAD1.0) for automatically evaluating the performance of LLM in EIS document question - answering tasks. 2. Evaluating the capabilities of LLM in long - document question - answering tasks. 3. Conducting a strict comparison of different types of context prompts (zero - sample prompts, paragraphs, PDF, RAG) to evaluate the model's performance. Through these studies, the paper aims to provide more effective solutions for long - document question - answering tasks in specific fields and promote the development of related technologies.

Examining Long-Context Large Language Models for Environmental Review Document Comprehension

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Investigating Answerability of LLMs for Long-Form Question Answering

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

LooGLE: Can Long-Context Language Models Understand Long Contexts?

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Comparison of Large Language Models for Generating Contextually Relevant Questions

Context Matter: Data-Efficient Augmentation of Large Language Models for Scientific Applications

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Systematic Evaluation of Long-Context LLMs on Financial Concepts

Enhancing Large Language Models' Situated Faithfulness to External Contexts

Long Context RAG Performance of Large Language Models

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

T-RAG: Lessons from the LLM Trenches

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering