DataLab: A Unified Platform for LLM-Powered Business Intelligence

Luoxuan Weng,Yinghao Tang,Yingchaojie Feng,Zhuo Chang,Peng Chen,Ruiqin Chen,Haozhe Feng,Chen Hou,Danqing Huang,Yang Li,Huaming Rao,Haonan Wang,Canshi Wei,Xiaofeng Yang,Yuhui Zhang,Yifeng Zheng,Xiuqi Huang,Minfeng Zhu,Yuxin Ma,Bin Cui,Wei Chen
2024-12-05
Abstract:Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports a wide range of BI tasks for different data roles by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.
Databases,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that in the field of Business Intelligence (BI), the existing large - language - model (LLM) agents mainly focus on a single task or stage, without considering the integrity of the BI workflow. This fragmentation of tasks and tools leads to poor information flow, low collaboration efficiency, and potential errors. Specifically, the paper aims to address the following three key challenges: 1. **Lack of Domain - Knowledge Integration**: - Existing research usually uses clean and synthetic research benchmarks to build and evaluate agents, but actual BI tasks usually involve large and complex real - world datasets. There are many ambiguities in these datasets. For example, the column names in business data tables may have unclear semantics, and user queries often contain enterprise - specific terms. Therefore, it is necessary to integrate extensive domain knowledge to enhance the agent's understanding of the input data and improve its performance in actual BI tasks. 2. **Insufficient Information Sharing across Tasks**: - Different tasks are usually managed by corresponding LLM agents to achieve optimal performance. However, when dealing with complex BI queries, information sharing between different agents is crucial. For example, the data retrieved by the SQL - writing agent must be accurately conveyed to the chart - generation agent. Therefore, a structured communication mechanism is required to ensure efficient information transfer between agents. 3. **The Need for Adaptive LLM Context Management**: - LLM agents rely on context windows (i.e., limited input tokens) to complete tasks. In a unified BI platform, multi - modal notebooks contain a large amount of context information (such as code snippets and their execution results, charts and their specifications). To improve system efficiency and cost - effectiveness, an adaptive context - management strategy is needed to selectively provide relevant context based on previous states and current user requirements. To solve these problems, the paper proposes DataLab, which is a unified BI platform that combines a one - stop LLM - agent framework and an enhanced computational - notebook interface. DataLab addresses the above challenges through the following three key modules: - **Domain - Knowledge - Integration Module**: Enhances the performance of LLM agents in specific enterprise BI tasks through automated knowledge generation, organization, and utilization. - **Inter - Agent - Communication Module**: Designs a structured communication mechanism that goes beyond pure natural language and uses finite - state machines (FSM) to control and optimize the information flow between agents. - **Cell - Based Context - Management Module**: Uses directed acyclic graphs (DAG) to represent the dependencies between cells, dynamically updates the dependency graph to adapt to user modifications, and thus selectively provides relevant context to improve context utilization. Through these innovations, DataLab not only improves the accuracy and efficiency of BI tasks but also shows significant performance improvements on actual enterprise datasets.