Examining Long-Context Large Language Models for Environmental Review Document Comprehension

Hung Phan,Anurag Acharya,Rounak Meyur,Sarthak Chaturvedi,Shivam Sharma,Mike Parker,Dan Nally,Ali Jannesari,Karl Pazdernik,Mahantesh Halappanavar,Sai Munikoti,Sameera Horawalavithana
2024-10-16
Abstract:As LLMs become increasingly ubiquitous, researchers have tried various techniques to augment the knowledge provided to these models. Long context and retrieval-augmented generation (RAG) are two such methods that have recently gained popularity. In this work, we examine the benefits of both of these techniques by utilizing question answering (QA) task in a niche domain. While the effectiveness of LLM-based QA systems has already been established at an acceptable level in popular domains such as trivia and literature, it has not often been established in niche domains that traditionally require specialized expertise. We construct the NEPAQuAD1.0 benchmark to evaluate the performance of five long-context LLMs -- Claude Sonnet, Gemini, GPT-4, Llama 3.1, and Mistral -- when answering questions originating from Environmental Impact Statements prepared by U.S. federal government agencies in accordance with the National Environmental Environmental Act (NEPA). We specifically measure the ability of LLMs to understand the nuances of legal, technical, and compliance-related information present in NEPA documents in different contextual scenarios. We test the LLMs' internal prior NEPA knowledge by providing questions without any context, as well as assess how LLMs synthesize the contextual information present in long NEPA documents to facilitate the question/answering task. We compare the performance of the models in handling different types of questions (e.g., problem-solving, divergent, etc.). Our results suggest that RAG powered models significantly outperform those provided with only the PDF context in terms of answer accuracy, regardless of the choice of the LLM. Our further analysis reveals that many models perform better answering closed type questions (Yes/No) than divergent and problem-solving questions.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use large - language models (LLM) for effective question - answering tasks in long documents in specific fields. Specifically, the paper focuses on the application in Environmental Impact Statement (EIS) documents, which are usually very long and contain complex legal, technical and compliance information. The paper constructs a new benchmark dataset (NEPAQuAD1.0) to evaluate the performance of five long - context LLMs (Claude Sonnet, Gemini, GPT - 4, Llama 3.1, and Mistral) under different context conditions, especially the capabilities of these models when dealing with different types of questions, such as closed - ended questions, divergent questions, problem - solving questions, etc. In addition, the paper also compares the impact of providing only PDF content and using Retrieval - Augmented Generation (RAG) - provided context on model performance. The main contributions of the paper include: 1. Creating the first preliminary benchmark (NEPAQuAD1.0) for automatically evaluating the performance of LLM in EIS document question - answering tasks. 2. Evaluating the capabilities of LLM in long - document question - answering tasks. 3. Conducting a strict comparison of different types of context prompts (zero - sample prompts, paragraphs, PDF, RAG) to evaluate the model's performance. Through these studies, the paper aims to provide more effective solutions for long - document question - answering tasks in specific fields and promote the development of related technologies.