The Design of an LLM-powered Unstructured Analytics System

Eric Anderson,Jonathan Fritz,Austin Lee,Bohou Li,Mark Lindblad,Henry Lindeman,Alex Meyer,Parth Parmar,Tanvi Ranade,Mehul A. Shah,Benjamin Sowell,Dan Tecuci,Vinayak Thapliyal,Matt Welsh
2024-09-05
Abstract:LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.
Databases,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to build a system capable of handling unstructured data analysis, enabling users to perform complex semantic analysis through natural - language queries. Specifically, the system needs to be able to: 1. **Go beyond traditional Retrieval - Augmented Generation (RAG) methods**: Existing RAG methods have limitations when dealing with complex problems, large - scale data, and complex data types, such as having a limited context window, embedding models being difficult to distinguish data chunks, and being unable to handle complex documents containing tables, charts, or images. 2. **Support multiple types of natural - language queries**: Users hope to conduct "sweep and harvest" mode queries, that is, traverse a large collection of documents, perform semantic operations (such as filtering, extracting, or summarizing information) described by natural language, and finally synthesize answers. In addition, it also supports "data integration" mode queries, that is, combine information from multiple document collections. 3. **Ensure the interpretability and accuracy of results**: Especially in fields such as finance, healthcare, and government intelligence, users need accurate and interpretable answers to avoid the hallucination phenomenon that may occur in LLMs. To solve these problems, the author designed a system named Aryn, which uses large - language models (LLMs) to handle unstructured data analysis. The main components of Aryn include: - **Sycamore**: A declarative document - processing engine for analyzing, enriching, and transforming complex unstructured documents. - **Luna**: A query planner that converts natural - language queries into Sycamore scripts. - **Aryn Partitioner**: Converts original PDFs and document images into DocSets for downstream processing. Through these components, Aryn can perform complex semantic analysis on large - scale unstructured data sets and provide interpretable results.