Abstract:Machine learning (ML) techniques for optimizing data management problems have been extensively studied and widely deployed in recent five years. However traditional ML methods have limitations on generalizability (adapting to different scenarios) and inference ability (understanding the context). Fortunately, large language models (LLMs) have shown high generalizability and human-competitive abilities in understanding context, which are promising for data management tasks (e.g., database diagnosis, database tuning). However, existing LLMs have several limitations: hallucination, high cost, and low accuracy for complicated tasks. To address these challenges, we design LLMDB, an LLM-enhanced data management paradigm which has generalizability and high inference ability while avoiding hallucination, reducing LLM cost, and achieving high accuracy. LLMDB embeds domain-specific knowledge to avoid hallucination by LLM fine-tuning and prompt engineering. LLMDB reduces the high cost of LLMs by vector databases which provide semantic search and caching abilities. LLMDB improves the task accuracy by LLM agent which provides multiple-round inference and pipeline executions. We showcase three real-world scenarios that LLMDB can well support, including query rewrite, database diagnosis and data analytics. We also summarize the open research challenges of LLMDB.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to design a data management framework based on large language models (LLM), called LLMDB, to overcome the limitations of traditional machine learning methods in data management tasks, particularly in terms of generalization and reasoning capabilities. #### Main Issues 1. **Generalization and Reasoning Capabilities**: Traditional machine learning methods struggle to adapt to different databases, query workloads, and hardware environments, and they lack the ability to understand context and perform multi-step reasoning. 2. **Limitations of LLM**: Existing large language models suffer from hallucination, high costs, and low accuracy in handling complex tasks. #### Solution The paper proposes LLMDB, a data management paradigm based on LLM, which has good generalization and reasoning capabilities while avoiding hallucination issues, reducing the cost of LLM, and improving task accuracy. #### Key Technologies 1. **Embedding Domain Knowledge**: Embedding domain-specific knowledge through fine-tuning and prompt engineering to reduce hallucination issues. 2. **Vector Database**: Utilizing vector databases to provide semantic search and caching functions, reducing the overhead of LLM. 3. **Multi-round Reasoning and Pipeline Execution**: Providing multi-round reasoning and pipeline execution through LLM agents to improve task accuracy. #### Application Scenarios 1. **Query Rewriting**: Transforming SQL queries into equivalent but more efficient queries. 2. **Database Diagnostics**: Automatically identifying anomalies in database systems and proposing solutions. 3. **Natural Language Data Analysis**: Supporting users in performing data analysis using natural language. #### Research Challenges 1. How to effectively understand user requests and generate execution pipelines? 2. How to select high-quality execution operations and combine them into high-quality execution pipelines? 3. How to design high-performance execution agents that utilize multiple operations to effectively answer complex tasks? 4. How to choose effective embedding methods to capture domain-specific similarities? 5. How to balance LLM fine-tuning and prompt engineering? In summary, this paper aims to address the limitations of existing LLMs in data management through LLMDB and demonstrates its applications in query rewriting, database diagnostics, and natural language data analysis.

LLM-Enhanced Data Management

DB-GPT: Large Language Model Meets Database

Demystifying Data Management for Large Language Models

LLM As DBA

Trustworthy and Efficient LLMs Meet Databases

Relational Database Augmented Large Language Model

Making LLMs Work for Enterprise Data Tasks

Applications and Challenges for Large Language Models: from Data Management Perspective

The Unreasonable Effectiveness of LLMs for Query Optimization

New Solutions on LLM Acceleration, Optimization, and Application

A Unified Transferable Model for ML-Enhanced DBMS

Data Management for Machine Learning: A Survey

LLMs as On-demand Customizable Service

MLog: towards declarative in-database machine learning

A Survey on Human-Centric LLMs

XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare

Understanding LLMs: A Comprehensive Overview from Training to Inference

LawLLM: Law Large Language Model for the US Legal System

Machine Learning for Data Management: A System View

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection