Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
PandasAI介绍和原理解析_back
LLM
数据分析
Pandas
ChatGPT Prompt
LLM数据分析PandasChatGPT Prompt
蔡恒兴
发布于 2023-09-24
推荐镜像 :Basic Image:ubuntu:22.04-py3.10-pytorch2.0
推荐机型 :c2_m4_cpu
1
1
第一部分: 功能介绍
环境准备
数据准备
使用 PandasAI
实例化一个LLM
查询功能
图表功能
多数据框功能
第二部分: 源码解读
1. 构造提示 (Prompt)
示例Prompt
2. 生成Python代码
Case 1: 查询5个最快乐的国家
Case 2: 查询2个最不快乐国家的GDP之和
Case 3: 绘制国家GDP的柱状图
Case 4: 查询GDP最高的国家及其GDP
3. 执行Python代码
错误处理和重试
4. 格式化输出

PandasAI是一个给pandas加持了大模型能力的Python库,可以提升你的数据分析和处理能力。

本教程包含两部分:

  • 第一部分对PandasAI的功能进行介绍
  • 第二部分对源码进行解析, 了解PandasAI是如何具体实现这些炫酷的功能
代码
文本

第一部分: 功能介绍

代码
文本

环境准备

首先,你需要在你的Python环境中安装 PandasAI 库。你可以通过以下命令进行安装:

代码
文本
[6]
!git clone https://github.com/gventuri/pandas-ai.git
Cloning into 'pandas-ai'...
remote: Enumerating objects: 3897, done.
remote: Counting objects: 100% (536/536), done.
remote: Compressing objects: 100% (257/257), done.
remote: Total 3897 (delta 344), reused 416 (delta 259), pack-reused 3361
Receiving objects: 100% (3897/3897), 2.17 MiB | 310.00 KiB/s, done.
Resolving deltas: 100% (2608/2608), done.
Updating files: 100% (167/167), done.
代码
文本
[3]
ls
logs/  pandas-ai/
代码
文本
[4]
!pip install ./pandas-ai
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing ./pandas-ai
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting astor<0.9.0,>=0.8.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c3/88/97eef84f48fa04fbd6750e62dcceafba6c63c81b7ac1420856c8dcc0a3f9/astor-0.8.1-py2.py3-none-any.whl (27 kB)
Requirement already satisfied: scipy<2.0.0,>=1.9.0 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.10.1)
Collecting duckdb<0.9.0,>=0.8.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0c/a3/4e52ef89606292b26864bcc3be3e36e1345ba4bb8a6df5b2fa36dfc01fd7/duckdb-0.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.9/15.9 MB 55.3 MB/s eta 0:00:0000:0100:01
Collecting pydantic<2,>=1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/e0/0371e9b6c910afe502e5fe18cc94562bfd9399617c7b4f5b6e13c29115b3/pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 35.6 MB/s eta 0:00:0000:01
Collecting matplotlib<4.0.0,>=3.7.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b5/24/aaccf324ce862bb82277e8814d2aebbb2a2c160d04e95aa2b8c9dc3137a9/matplotlib-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 58.5 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: pandas==1.5.3 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.5.3)
Collecting sqlalchemy<2.0.0,>=1.4.49
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/77/68/4ce3f0677a4c5f51a91624a7c41921ea39aac1e39502d252ff339ec6cd3b/SQLAlchemy-1.4.49-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 26.8 MB/s eta 0:00:00:00:01
Collecting openai<0.28.0,>=0.27.5
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f1/1f/3a0cb7d172f451b2ca8bf65d9196aa3b6878c010d461257c621e4bd48cad/openai-0.27.10-py3-none-any.whl (76 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.5/76.5 kB 1.7 MB/s eta 0:00:00ta 0:00:01
Collecting python-dotenv<2.0.0,>=1.0.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/2f/62ea1c8b593f4e093cc1a7768f0d46112107e790c3e478532329e434f00b/python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting ipython<9.0.0,>=8.13.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7f/d0/c3eb7b17b013da59925aed7b2e7c55f8f1c9209249316812fe8cb758b337/ipython-8.15.0-py3-none-any.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 kB 15.9 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: numpy>=1.21.0 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (1.24.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2022.7.1)
Requirement already satisfied: stack-data in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.6.2)
Requirement already satisfied: matplotlib-inline in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.1.6)
Requirement already satisfied: pygments>=2.4.0 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.14.0)
Requirement already satisfied: backcall in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.0)
Requirement already satisfied: traitlets>=5 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.9.0)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (3.0.38)
Collecting exceptiongroup
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ad/83/b71e58666f156a39fb29417e4c8ca4bc7400c0dd4ed9e8842ab54dc8c344/exceptiongroup-1.1.3-py3-none-any.whl (14 kB)
Requirement already satisfied: pexpect>4.3 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (4.8.0)
Requirement already satisfied: decorator in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.1.1)
Requirement already satisfied: pickleshare in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.5)
Requirement already satisfied: jedi>=0.16 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.18.2)
Collecting cycler>=0.10
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/5c/f9/695d6bedebd747e5eb0fe8fad57b72fdf25411273a39791cde838d5a8f51/cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting kiwisolver>=1.0.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6f/40/4ab1fdb57fced80ce5903f04ae1aed7c1d5939dda4fd0c0aa526c12fe28a/kiwisolver-1.4.5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 71.0 MB/s eta 0:00:00
Collecting pillow>=6.2.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7a/07/e896b096a77375e78e02ce222ae4fd6014928cd76c691d312060a1645dfa/Pillow-10.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 44.3 MB/s eta 0:00:0000:01m
Collecting pyparsing>=2.3.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/39/92/8486ede85fcc088f1b3dba4ce92dd29d126fd96b0008ea213167940a2475/pyparsing-3.1.1-py3-none-any.whl (103 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.1/103.1 kB 33.8 MB/s eta 0:00:00
Collecting contourpy>=1.0.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f1/6b/e4b0f8708f22dd7c321f87eadbb98708975e115ac6582eb46d1f32197ce6/contourpy-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.7/301.7 kB 6.9 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (23.0)
Collecting fonttools>=4.22.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2b/e8/61b8525acf26ec222518bdff127ae502bfa3408981fb5e5493f2b037d7fb/fonttools-4.42.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 71.8 MB/s eta 0:00:00ta 0:00:01
Requirement already satisfied: requests>=2.20 in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.28.1)
Requirement already satisfied: tqdm in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (4.64.1)
Collecting aiohttp
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3e/f6/fcda07dd1e72260989f0b22dde999ecfe80daa744f23ca167083683399bc/aiohttp-3.8.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 19.5 MB/s eta 0:00:00ta 0:00:01
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/mamba/lib/python3.10/site-packages (from pydantic<2,>=1->pandasai==1.2.6) (4.5.0)
Collecting greenlet!=0.4.17
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6e/11/a1f1af20b6a1a8069bc75012569d030acb89fd7ef70f888b6af2f85accc6/greenlet-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (613 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 613.7/613.7 kB 75.5 MB/s eta 0:00:00
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/mamba/lib/python3.10/site-packages (from jedi>=0.16->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.8.3)
Requirement already satisfied: ptyprocess>=0.5 in /opt/mamba/lib/python3.10/site-packages (from pexpect>4.3->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.0)
Requirement already satisfied: wcwidth in /opt/mamba/lib/python3.10/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.6)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas==1.5.3->pandasai==1.2.6) (1.16.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.26.11)
Requirement already satisfied: idna<4,>=2.5 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (3.4)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2022.9.24)
Collecting frozenlist>=1.1.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1e/28/74b8b6451c89c070d34e753d8b65a1e4ce508a6808b18529f36e8c0e2184/frozenlist-1.4.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (225 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 225.7/225.7 kB 5.2 MB/s eta 0:00:00a 0:00:01
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a7/fa/e01228c2938de91d47b307831c62ab9e4001e747789d0b05baf779a6488c/async_timeout-4.0.3-py3-none-any.whl (5.7 kB)
Collecting aiosignal>=1.1.2
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/76/ac/a7305707cb852b7e16ff80eaf5692309bde30e2b1100a1fcacdc8f731d97/aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Requirement already satisfied: attrs>=17.3.0 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (22.2.0)
Collecting yarl<2.0,>=1.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c9/d4/a5280faa1b8e9ad3a52ddc4c9aea94dd718f9c55f1e10cfb14580f5ebb45/yarl-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (268 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 49.6 MB/s eta 0:00:00
Collecting multidict<7.0,>=4.5
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/56/b5/ac112889bfc68e6cf4eda1e4325789b166c51c6cd29d5633e28fb2c2f966/multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.5/114.5 kB 28.9 MB/s eta 0:00:00
Requirement already satisfied: executing>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (1.2.0)
Requirement already satisfied: asttokens>=2.1.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.2.1)
Requirement already satisfied: pure-eval in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.2)
Building wheels for collected packages: pandasai
  Building wheel for pandasai (pyproject.toml) ... done
  Created wheel for pandasai: filename=pandasai-1.2.6-py3-none-any.whl size=73182 sha256=a6977e8d2e7b742fd392d2ab6f64df39db9a10206ad4fd42b2e9ef1a9573e247
  Stored in directory: /root/.cache/pip/wheels/fd/80/0e/fe37825be53681dfb13795dbaf9c50a667382e40dcfb858388
Successfully built pandasai
Installing collected packages: duckdb, python-dotenv, pyparsing, pydantic, pillow, multidict, kiwisolver, greenlet, frozenlist, fonttools, exceptiongroup, cycler, contourpy, async-timeout, astor, yarl, sqlalchemy, matplotlib, aiosignal, ipython, aiohttp, openai, pandasai
  Attempting uninstall: ipython
    Found existing installation: ipython 8.11.0
    Uninstalling ipython-8.11.0:
      Successfully uninstalled ipython-8.11.0
Successfully installed aiohttp-3.8.5 aiosignal-1.3.1 astor-0.8.1 async-timeout-4.0.3 contourpy-1.1.1 cycler-0.11.0 duckdb-0.8.1 exceptiongroup-1.1.3 fonttools-4.42.1 frozenlist-1.4.0 greenlet-2.0.2 ipython-8.15.0 kiwisolver-1.4.5 matplotlib-3.8.0 multidict-6.0.4 openai-0.27.10 pandasai-1.2.6 pillow-10.0.1 pydantic-1.10.12 pyparsing-3.1.1 python-dotenv-1.0.0 sqlalchemy-1.4.49 yarl-1.9.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本

数据准备

我们将使用一些样本数据来演示如何使用 PandasAI。首先,我们需要创建一个 pandas DataFrame,并导入 PandasAI 库。

代码
文本
[5]
import pandas as pd
from pandasai import SmartDataframe

# 样本 DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
代码
文本

使用 PandasAI

代码
文本

实例化一个LLM

代码
文本
[ ]
from pandasai.llm import OpenAI

llm = OpenAI(api_token="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") # 替换成你自己的openai key.
代码
文本

查询功能

PandasAI 允许你使用自然语言查询数据。在下面的示例中,我们将询问 DataFrame 中幸福指数排名前5的国家:

代码
文本
[ ]
df = SmartDataframe(df, config={"llm": llm})
df.chat('Which are the 5 happiest countries?')
代码
文本

图表功能

你还可以要求 PandasAI 为你绘制图形。例如,你可以通过以下方式要求绘制一个直方图:

代码
文本
[ ]
df.chat("Plot the histogram of countries showing for each the gdp, using different colors for each bar")
代码
文本

多数据框功能

PandasAI 还允许你关联多个数据框并对它们进行查询:

代码
文本
[ ]
import pandas as pd
from pandasai import SmartDatalake
from pandasai.llm import OpenAI

employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance']
}

salaries_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Salary': [5000, 6000, 4500, 7000, 5500]
}

employees_df = pd.DataFrame(employees_data)
salaries_df = pd.DataFrame(salaries_data)

dl = SmartDatalake([employees_df, salaries_df], config={"llm": llm})
dl.chat("Who gets paid the most?")
代码
文本

第二部分: 源码解读

代码
文本

项目源码可以在以下地址找到: GitHub源码仓库

代码解析部分, 我们主要探讨核心逻辑的代码实现。 核心逻辑为:构造提示 (prompt)、生成Python代码、执行代码和格式化输出这4个步骤。

我们进入到/Users/dp/learn/pandas-ai/pandasai/smart_datalake/init.py文件的SmartDatalake类下的chat函数中可以看到上述代码流程,下面我们将逐步分析 SmartDatalake 类中的 chat 函数,该函数是实现上述流程的核心。

1. 构造提示 (Prompt)

chat 函数中,首先通过 _get_prompt 方法来构造一个提示。该提示会被用来引导模型生成Python代码。提示的构造包括以下几个主要部分:

  1. 输入用户的上下文
  2. 当前query
  3. DataFrame的元信息

示例Prompt

You are provided with the following pandas DataFrames with the following metadata:

{dataframes}

This is the initial python code to be updated:
# python
# TODO import all the dependencies required
{default_import}

# Analyze the data
# 1. Prepare: Preprocessing and cleaning data if necessary
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
# 3. Analyze: Conducting the actual analysis (if the user asks to create a chart save it to an image in exports/charts/temp_chart.png and do not show the chart.)
# 4. Output: return a dictionary of:
# - type (possible values "text", "number", "dataframe", "plot")
# - value (can be a string, a dataframe or the path of the plot, NOT a dictionary)
# Example output: {{ "type": "text", "value": "The average loan amount is $15,000." }}
def analyze_data(dfs: list[{engine_df_name}]) -> dict:
    # Code goes here (do not add comments)


# Declare a result variable
result = analyze_data(dfs)

Using the provided dataframes (`dfs`), update the python code based on the last user question:
{conversation}

Updated code:

2. 生成Python代码

在构造了提示之后,通过 _llm.generate_code 方法来根据提示生成Python代码。该步骤中还包括代码的解析和处理。

下面我们通过几个实例来查看生成的Python代码:

Case 1: 查询5个最快乐的国家

Query:

Which are the 5 happiest countries?

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    # Code goes here (do not add comments)
    ...
    # Select the top 5 happiest countries
    top_5_happiest_countries = df_sorted.head(5)
    ...
    return {"type": "dataframe", "value": top_5_happiest_countries}

# Declare a result variable
result = analyze_data(dfs)

Case 2: 查询2个最不快乐国家的GDP之和

Query:

What is the sum of the GDPs of the 2 unhappiest countries?

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    ...
    # Get the sum of the GDPs of the 2 unhappiest countries
    sum_gdp = df_sorted.head(2)['gdp'].sum()
    ...
    return {"type": "number", "value": sum_gdp}

# Declare a result variable
result = analyze_data(dfs)

Case 3: 绘制国家GDP的柱状图

Query:

Plot the histogram of countries showing for each the gdp, using different colors for each bar

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    ...
    df.plot(kind='bar', x='country', y='gdp', color='gdp', legend=False)
    ...
    return {"type": "plot", "value": "exports/charts/temp_chart.png"}

# Declare a result variable
result = analyze_data(dfs)

Case 4: 查询GDP最高的国家及其GDP

Query:

gdp最高的国家对应的gpd是多少

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    ...
    # Return the result
    return {"type": "text", "value": f"The GDP of the country with the highest GDP is {max_gdp}."}

# Declare a result variable
result = analyze_data(dfs)

此步骤中模型做了几件关键事情:

  1. 区分了输出内容的格式(text/dataframe/plot/number),以方便后续的格式化输出。
  2. 完成了从查询到具体指令的转换。
  3. 实现了数据的解读和解释。

3. 执行Python代码

执行生成的Python代码是在一个循环中完成的,这个循环有一个最大重试次数。如果代码执行成功,则进入下一步。如果代码执行失败,则捕获错误信息,利用错误信息构造新的提示,并再次调用模型生成代码。

错误处理和重试

在遇到错误时,系统会构造一个新的提示,包含错误信息,用来生成新的、修正的代码。

错误处理的提示示例:

You are provided with a pandas dataframe (df) with {num_rows} rows and {num_columns} columns.
This is the metadata of the dataframe:
{df_head}.
The user asked the following question:
{conversation}
You generated this python code:
{code}
It fails with the following error:
{error_returned}
Correct the python code and return a new python code (do not import anything) that fixes the above mentioned error. Do not generate the same code again.

4. 格式化输出

在执行Python代码后,会对结果进行格式化处理。格式化依据结果的类型而有所不同:

  1. 如果结果类型为dataframe,会用Polars对其进行转换,以优化内存,然后返回dataframe
  2. 如果结果类型为plot,会显示该图表
  3. 其他类型则返回result["value"]
代码
文本
[ ]

代码
文本
[ ]

代码
文本
LLM
数据分析
Pandas
ChatGPT Prompt
LLM数据分析PandasChatGPT Prompt
点个赞吧
推荐阅读
公开
PandasAI介绍和原理解析
LLM数据分析PandasChatGPT Prompt
LLM数据分析PandasChatGPT Prompt
蔡恒兴
发布于 2023-09-24
1 赞
公开
19. Pandas怎样对每个分组应用apply函数?
pythonPandas数据分析
pythonPandas数据分析
panjw@dp.tech
更新于 2024-08-06