PandasAI是一个给pandas加持了大模型能力的Python库,可以提升你的数据分析和处理能力。
本教程包含两部分:
- 第一部分对PandasAI的功能进行介绍
- 第二部分对源码进行解析, 了解PandasAI是如何具体实现这些炫酷的功能
第一部分: 功能介绍
环境准备
首先,你需要在你的Python环境中安装 PandasAI 库。你可以通过以下命令进行安装:
Cloning into 'pandas-ai'... remote: Enumerating objects: 3897, done. remote: Counting objects: 100% (536/536), done. remote: Compressing objects: 100% (257/257), done. remote: Total 3897 (delta 344), reused 416 (delta 259), pack-reused 3361 Receiving objects: 100% (3897/3897), 2.17 MiB | 310.00 KiB/s, done. Resolving deltas: 100% (2608/2608), done. Updating files: 100% (167/167), done.
logs/ pandas-ai/
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Processing ./pandas-ai Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Collecting astor<0.9.0,>=0.8.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c3/88/97eef84f48fa04fbd6750e62dcceafba6c63c81b7ac1420856c8dcc0a3f9/astor-0.8.1-py2.py3-none-any.whl (27 kB) Requirement already satisfied: scipy<2.0.0,>=1.9.0 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.10.1) Collecting duckdb<0.9.0,>=0.8.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0c/a3/4e52ef89606292b26864bcc3be3e36e1345ba4bb8a6df5b2fa36dfc01fd7/duckdb-0.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.9/15.9 MB 55.3 MB/s eta 0:00:0000:0100:01 Collecting pydantic<2,>=1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/e0/0371e9b6c910afe502e5fe18cc94562bfd9399617c7b4f5b6e13c29115b3/pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 35.6 MB/s eta 0:00:0000:01 Collecting matplotlib<4.0.0,>=3.7.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b5/24/aaccf324ce862bb82277e8814d2aebbb2a2c160d04e95aa2b8c9dc3137a9/matplotlib-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 58.5 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: pandas==1.5.3 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.5.3) Collecting sqlalchemy<2.0.0,>=1.4.49 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/77/68/4ce3f0677a4c5f51a91624a7c41921ea39aac1e39502d252ff339ec6cd3b/SQLAlchemy-1.4.49-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 26.8 MB/s eta 0:00:00:00:01 Collecting openai<0.28.0,>=0.27.5 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f1/1f/3a0cb7d172f451b2ca8bf65d9196aa3b6878c010d461257c621e4bd48cad/openai-0.27.10-py3-none-any.whl (76 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.5/76.5 kB 1.7 MB/s eta 0:00:00ta 0:00:01 Collecting python-dotenv<2.0.0,>=1.0.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/2f/62ea1c8b593f4e093cc1a7768f0d46112107e790c3e478532329e434f00b/python_dotenv-1.0.0-py3-none-any.whl (19 kB) Collecting ipython<9.0.0,>=8.13.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7f/d0/c3eb7b17b013da59925aed7b2e7c55f8f1c9209249316812fe8cb758b337/ipython-8.15.0-py3-none-any.whl (806 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 kB 15.9 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: numpy>=1.21.0 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (1.24.2) Requirement already satisfied: python-dateutil>=2.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2022.7.1) Requirement already satisfied: stack-data in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.6.2) Requirement already satisfied: matplotlib-inline in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.1.6) Requirement already satisfied: pygments>=2.4.0 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.14.0) Requirement already satisfied: backcall in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.0) Requirement already satisfied: traitlets>=5 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.9.0) Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (3.0.38) Collecting exceptiongroup Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ad/83/b71e58666f156a39fb29417e4c8ca4bc7400c0dd4ed9e8842ab54dc8c344/exceptiongroup-1.1.3-py3-none-any.whl (14 kB) Requirement already satisfied: pexpect>4.3 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (4.8.0) Requirement already satisfied: decorator in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.1.1) Requirement already satisfied: pickleshare in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.5) Requirement already satisfied: jedi>=0.16 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.18.2) Collecting cycler>=0.10 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/5c/f9/695d6bedebd747e5eb0fe8fad57b72fdf25411273a39791cde838d5a8f51/cycler-0.11.0-py3-none-any.whl (6.4 kB) Collecting kiwisolver>=1.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6f/40/4ab1fdb57fced80ce5903f04ae1aed7c1d5939dda4fd0c0aa526c12fe28a/kiwisolver-1.4.5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 71.0 MB/s eta 0:00:00 Collecting pillow>=6.2.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7a/07/e896b096a77375e78e02ce222ae4fd6014928cd76c691d312060a1645dfa/Pillow-10.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 44.3 MB/s eta 0:00:0000:01m Collecting pyparsing>=2.3.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/39/92/8486ede85fcc088f1b3dba4ce92dd29d126fd96b0008ea213167940a2475/pyparsing-3.1.1-py3-none-any.whl (103 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.1/103.1 kB 33.8 MB/s eta 0:00:00 Collecting contourpy>=1.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f1/6b/e4b0f8708f22dd7c321f87eadbb98708975e115ac6582eb46d1f32197ce6/contourpy-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.7/301.7 kB 6.9 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (23.0) Collecting fonttools>=4.22.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2b/e8/61b8525acf26ec222518bdff127ae502bfa3408981fb5e5493f2b037d7fb/fonttools-4.42.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 71.8 MB/s eta 0:00:00ta 0:00:01 Requirement already satisfied: requests>=2.20 in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.28.1) Requirement already satisfied: tqdm in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (4.64.1) Collecting aiohttp Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3e/f6/fcda07dd1e72260989f0b22dde999ecfe80daa744f23ca167083683399bc/aiohttp-3.8.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 19.5 MB/s eta 0:00:00ta 0:00:01 Requirement already satisfied: typing-extensions>=4.2.0 in /opt/mamba/lib/python3.10/site-packages (from pydantic<2,>=1->pandasai==1.2.6) (4.5.0) Collecting greenlet!=0.4.17 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6e/11/a1f1af20b6a1a8069bc75012569d030acb89fd7ef70f888b6af2f85accc6/greenlet-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (613 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 613.7/613.7 kB 75.5 MB/s eta 0:00:00 Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/mamba/lib/python3.10/site-packages (from jedi>=0.16->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.8.3) Requirement already satisfied: ptyprocess>=0.5 in /opt/mamba/lib/python3.10/site-packages (from pexpect>4.3->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.0) Requirement already satisfied: wcwidth in /opt/mamba/lib/python3.10/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.6) Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas==1.5.3->pandasai==1.2.6) (1.16.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.26.11) Requirement already satisfied: idna<4,>=2.5 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (3.4) Requirement already satisfied: charset-normalizer<3,>=2 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.1.1) Requirement already satisfied: certifi>=2017.4.17 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2022.9.24) Collecting frozenlist>=1.1.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1e/28/74b8b6451c89c070d34e753d8b65a1e4ce508a6808b18529f36e8c0e2184/frozenlist-1.4.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (225 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 225.7/225.7 kB 5.2 MB/s eta 0:00:00a 0:00:01 Collecting async-timeout<5.0,>=4.0.0a3 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a7/fa/e01228c2938de91d47b307831c62ab9e4001e747789d0b05baf779a6488c/async_timeout-4.0.3-py3-none-any.whl (5.7 kB) Collecting aiosignal>=1.1.2 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/76/ac/a7305707cb852b7e16ff80eaf5692309bde30e2b1100a1fcacdc8f731d97/aiosignal-1.3.1-py3-none-any.whl (7.6 kB) Requirement already satisfied: attrs>=17.3.0 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (22.2.0) Collecting yarl<2.0,>=1.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c9/d4/a5280faa1b8e9ad3a52ddc4c9aea94dd718f9c55f1e10cfb14580f5ebb45/yarl-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (268 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 49.6 MB/s eta 0:00:00 Collecting multidict<7.0,>=4.5 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/56/b5/ac112889bfc68e6cf4eda1e4325789b166c51c6cd29d5633e28fb2c2f966/multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.5/114.5 kB 28.9 MB/s eta 0:00:00 Requirement already satisfied: executing>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (1.2.0) Requirement already satisfied: asttokens>=2.1.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.2.1) Requirement already satisfied: pure-eval in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.2) Building wheels for collected packages: pandasai Building wheel for pandasai (pyproject.toml) ... done Created wheel for pandasai: filename=pandasai-1.2.6-py3-none-any.whl size=73182 sha256=a6977e8d2e7b742fd392d2ab6f64df39db9a10206ad4fd42b2e9ef1a9573e247 Stored in directory: /root/.cache/pip/wheels/fd/80/0e/fe37825be53681dfb13795dbaf9c50a667382e40dcfb858388 Successfully built pandasai Installing collected packages: duckdb, python-dotenv, pyparsing, pydantic, pillow, multidict, kiwisolver, greenlet, frozenlist, fonttools, exceptiongroup, cycler, contourpy, async-timeout, astor, yarl, sqlalchemy, matplotlib, aiosignal, ipython, aiohttp, openai, pandasai Attempting uninstall: ipython Found existing installation: ipython 8.11.0 Uninstalling ipython-8.11.0: Successfully uninstalled ipython-8.11.0 Successfully installed aiohttp-3.8.5 aiosignal-1.3.1 astor-0.8.1 async-timeout-4.0.3 contourpy-1.1.1 cycler-0.11.0 duckdb-0.8.1 exceptiongroup-1.1.3 fonttools-4.42.1 frozenlist-1.4.0 greenlet-2.0.2 ipython-8.15.0 kiwisolver-1.4.5 matplotlib-3.8.0 multidict-6.0.4 openai-0.27.10 pandasai-1.2.6 pillow-10.0.1 pydantic-1.10.12 pyparsing-3.1.1 python-dotenv-1.0.0 sqlalchemy-1.4.49 yarl-1.9.2 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
数据准备
我们将使用一些样本数据来演示如何使用 PandasAI。首先,我们需要创建一个 pandas DataFrame,并导入 PandasAI 库。
使用 PandasAI
实例化一个LLM
查询功能
PandasAI 允许你使用自然语言查询数据。在下面的示例中,我们将询问 DataFrame 中幸福指数排名前5的国家:
图表功能
你还可以要求 PandasAI 为你绘制图形。例如,你可以通过以下方式要求绘制一个直方图:
多数据框功能
PandasAI 还允许你关联多个数据框并对它们进行查询:
第二部分: 源码解读
项目源码可以在以下地址找到: GitHub源码仓库
代码解析部分, 我们主要探讨核心逻辑的代码实现。 核心逻辑为:构造提示 (prompt)、生成Python代码、执行代码和格式化输出这4个步骤。
我们进入到/Users/dp/learn/pandas-ai/pandasai/smart_datalake/init.py文件的SmartDatalake类下的chat函数中可以看到上述代码流程,下面我们将逐步分析 SmartDatalake
类中的 chat
函数,该函数是实现上述流程的核心。
1. 构造提示 (Prompt)
在 chat
函数中,首先通过 _get_prompt
方法来构造一个提示。该提示会被用来引导模型生成Python代码。提示的构造包括以下几个主要部分:
- 输入用户的上下文
- 当前query
- DataFrame的元信息
示例Prompt
You are provided with the following pandas DataFrames with the following metadata:
{dataframes}
This is the initial python code to be updated:
# python
# TODO import all the dependencies required
{default_import}
# Analyze the data
# 1. Prepare: Preprocessing and cleaning data if necessary
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
# 3. Analyze: Conducting the actual analysis (if the user asks to create a chart save it to an image in exports/charts/temp_chart.png and do not show the chart.)
# 4. Output: return a dictionary of:
# - type (possible values "text", "number", "dataframe", "plot")
# - value (can be a string, a dataframe or the path of the plot, NOT a dictionary)
# Example output: {{ "type": "text", "value": "The average loan amount is $15,000." }}
def analyze_data(dfs: list[{engine_df_name}]) -> dict:
# Code goes here (do not add comments)
# Declare a result variable
result = analyze_data(dfs)
Using the provided dataframes (`dfs`), update the python code based on the last user question:
{conversation}
Updated code:
2. 生成Python代码
在构造了提示之后,通过 _llm.generate_code
方法来根据提示生成Python代码。该步骤中还包括代码的解析和处理。
下面我们通过几个实例来查看生成的Python代码:
Case 1: 查询5个最快乐的国家
Query:
Which are the 5 happiest countries?
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
# Code goes here (do not add comments)
...
# Select the top 5 happiest countries
top_5_happiest_countries = df_sorted.head(5)
...
return {"type": "dataframe", "value": top_5_happiest_countries}
# Declare a result variable
result = analyze_data(dfs)
Case 2: 查询2个最不快乐国家的GDP之和
Query:
What is the sum of the GDPs of the 2 unhappiest countries?
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
...
# Get the sum of the GDPs of the 2 unhappiest countries
sum_gdp = df_sorted.head(2)['gdp'].sum()
...
return {"type": "number", "value": sum_gdp}
# Declare a result variable
result = analyze_data(dfs)
Case 3: 绘制国家GDP的柱状图
Query:
Plot the histogram of countries showing for each the gdp, using different colors for each bar
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
...
df.plot(kind='bar', x='country', y='gdp', color='gdp', legend=False)
...
return {"type": "plot", "value": "exports/charts/temp_chart.png"}
# Declare a result variable
result = analyze_data(dfs)
Case 4: 查询GDP最高的国家及其GDP
Query:
gdp最高的国家对应的gpd是多少
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
...
# Return the result
return {"type": "text", "value": f"The GDP of the country with the highest GDP is {max_gdp}."}
# Declare a result variable
result = analyze_data(dfs)
此步骤中模型做了几件关键事情:
- 区分了输出内容的格式(text/dataframe/plot/number),以方便后续的格式化输出。
- 完成了从查询到具体指令的转换。
- 实现了数据的解读和解释。
3. 执行Python代码
执行生成的Python代码是在一个循环中完成的,这个循环有一个最大重试次数。如果代码执行成功,则进入下一步。如果代码执行失败,则捕获错误信息,利用错误信息构造新的提示,并再次调用模型生成代码。
错误处理和重试
在遇到错误时,系统会构造一个新的提示,包含错误信息,用来生成新的、修正的代码。
错误处理的提示示例:
You are provided with a pandas dataframe (df) with {num_rows} rows and {num_columns} columns.
This is the metadata of the dataframe:
{df_head}.
The user asked the following question:
{conversation}
You generated this python code:
{code}
It fails with the following error:
{error_returned}
Correct the python code and return a new python code (do not import anything) that fixes the above mentioned error. Do not generate the same code again.
4. 格式化输出
在执行Python代码后,会对结果进行格式化处理。格式化依据结果的类型而有所不同:
- 如果结果类型为dataframe,会用Polars对其进行转换,以优化内存,然后返回dataframe
- 如果结果类型为plot,会显示该图表
- 其他类型则返回result["value"]