Abstract:LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to build a system capable of handling unstructured data analysis, enabling users to perform complex semantic analysis through natural - language queries. Specifically, the system needs to be able to: 1. **Go beyond traditional Retrieval - Augmented Generation (RAG) methods**: Existing RAG methods have limitations when dealing with complex problems, large - scale data, and complex data types, such as having a limited context window, embedding models being difficult to distinguish data chunks, and being unable to handle complex documents containing tables, charts, or images. 2. **Support multiple types of natural - language queries**: Users hope to conduct "sweep and harvest" mode queries, that is, traverse a large collection of documents, perform semantic operations (such as filtering, extracting, or summarizing information) described by natural language, and finally synthesize answers. In addition, it also supports "data integration" mode queries, that is, combine information from multiple document collections. 3. **Ensure the interpretability and accuracy of results**: Especially in fields such as finance, healthcare, and government intelligence, users need accurate and interpretable answers to avoid the hallucination phenomenon that may occur in LLMs. To solve these problems, the author designed a system named Aryn, which uses large - language models (LLMs) to handle unstructured data analysis. The main components of Aryn include: - **Sycamore**: A declarative document - processing engine for analyzing, enriching, and transforming complex unstructured documents. - **Luna**: A query planner that converts natural - language queries into Sycamore scripts. - **Aryn Partitioner**: Converts original PDFs and document images into DocSets for downstream processing. Through these components, Aryn can perform complex semantic analysis on large - scale unstructured data sets and provide interpretable results.

The Design of an LLM-powered Unstructured Analytics System

LLM Augmentations to support Analytical Reasoning over Multiple Documents

Towards Accurate and Efficient Document Analytics with Large Language Models

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Designing an Evaluation Framework for Large Language Models in Astronomy Research

An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project

Satyrn: A Platform for Analytics Augmented Generation

On the Design and Analysis of LLM-Based Algorithms

Can LLMs Generate Architectural Design Decisions? -An Exploratory Empirical study

Beyond LLMs: Advancing the Landscape of Complex Reasoning

Optimizing LLM Queries in Relational Workloads

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization

LawLLM: Law Large Language Model for the US Legal System

Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs

Are LLMs Ready for Real-World Materials Discovery?

Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models

From PDFs to Structured Data: Utilizing LLM Analysis in Sports Database Management

An LLM Agent for Automatic Geospatial Data Analysis

A Survey on Human-Centric LLMs

A Preliminary Roadmap for LLMs as Assistants in Exploring, Analyzing, and Visualizing Knowledge Graphs