Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
PandasAI介绍和原理解析
LLM
数据分析
Pandas
ChatGPT Prompt
LLM数据分析PandasChatGPT Prompt
蔡恒兴
发布于 2023-09-24
推荐镜像 :Basic Image:ubuntu:22.04-py3.10-pytorch2.0
推荐机型 :c2_m4_cpu
赞 1
第一部分: 功能介绍
环境准备
数据准备
使用 PandasAI
实例化SmartDataframe
查询功能
图表功能
多数据框功能
第二部分: 源码解读
1. 构造提示 (Prompt)
示例Prompt
2. 生成Python代码
Case 1: 查询5个最快乐的国家
Case 2: 查询2个最不快乐国家的GDP之和
Case 3: 绘制国家GDP的柱状图
Case 4: 查询GDP最高的国家及其GDP
3. 执行Python代码
错误处理和重试
4. 格式化输出

PandasAI是一个给pandas加持了大模型能力的Python库,可以提升你的数据分析和处理能力。

本教程包含两部分:

  • 第一部分对PandasAI的功能进行介绍
  • 第二部分对源码进行解析, 了解PandasAI是如何具体实现这些炫酷的功能
代码
文本

第一部分: 功能介绍

代码
文本

环境准备

首先,你需要在你的Python环境中安装 PandasAI 库。你可以通过以下命令进行安装:

代码
文本
[1]
!git clone https://github.com/gventuri/pandas-ai.git
!pip install ./pandas-ai
Cloning into 'pandas-ai'...
remote: Enumerating objects: 3903, done.
remote: Counting objects: 100% (542/542), done.
remote: Compressing objects: 100% (265/265), done.
remote: Total 3903 (delta 349), reused 415 (delta 257), pack-reused 3361
Receiving objects: 100% (3903/3903), 2.17 MiB | 301.00 KiB/s, done.
Resolving deltas: 100% (2613/2613), done.
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing /pandas-ai
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: duckdb<0.9.0,>=0.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (0.8.1)
Requirement already satisfied: scipy<2.0.0,>=1.9.0 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.10.1)
Requirement already satisfied: pandas==1.5.3 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.5.3)
Requirement already satisfied: matplotlib<4.0.0,>=3.7.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (3.8.0)
Requirement already satisfied: openai<0.28.0,>=0.27.5 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (0.27.10)
Requirement already satisfied: python-dotenv<2.0.0,>=1.0.0 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.0.0)
Requirement already satisfied: pydantic<2,>=1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.10.12)
Requirement already satisfied: ipython<9.0.0,>=8.13.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (8.15.0)
Requirement already satisfied: sqlalchemy<2.0.0,>=1.4.49 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.4.49)
Requirement already satisfied: astor<0.9.0,>=0.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (0.8.1)
Requirement already satisfied: numpy>=1.21.0 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (1.24.2)
Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2.8.2)
Requirement already satisfied: decorator in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.1.1)
Requirement already satisfied: backcall in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.0)
Requirement already satisfied: pexpect>4.3 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (4.8.0)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (3.0.38)
Requirement already satisfied: exceptiongroup in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (1.1.3)
Requirement already satisfied: pickleshare in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.5)
Requirement already satisfied: stack-data in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.6.2)
Requirement already satisfied: traitlets>=5 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.9.0)
Requirement already satisfied: jedi>=0.16 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.18.2)
Requirement already satisfied: matplotlib-inline in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.1.6)
Requirement already satisfied: pygments>=2.4.0 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.14.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (4.42.1)
Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (23.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (10.0.1)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (3.1.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (1.4.5)
Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (1.1.1)
Requirement already satisfied: requests>=2.20 in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.28.1)
Requirement already satisfied: aiohttp in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (3.8.5)
Requirement already satisfied: tqdm in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (4.64.1)
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/mamba/lib/python3.10/site-packages (from pydantic<2,>=1->pandasai==1.2.6) (4.5.0)
Requirement already satisfied: greenlet!=0.4.17 in /opt/mamba/lib/python3.10/site-packages (from sqlalchemy<2.0.0,>=1.4.49->pandasai==1.2.6) (2.0.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/mamba/lib/python3.10/site-packages (from jedi>=0.16->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.8.3)
Requirement already satisfied: ptyprocess>=0.5 in /opt/mamba/lib/python3.10/site-packages (from pexpect>4.3->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.0)
Requirement already satisfied: wcwidth in /opt/mamba/lib/python3.10/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.6)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas==1.5.3->pandasai==1.2.6) (1.16.0)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2022.9.24)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.26.11)
Requirement already satisfied: idna<4,>=2.5 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (3.4)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (22.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (4.0.3)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.4.0)
Requirement already satisfied: pure-eval in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.2)
Requirement already satisfied: asttokens>=2.1.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.2.1)
Requirement already satisfied: executing>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (1.2.0)
Building wheels for collected packages: pandasai
  Building wheel for pandasai (pyproject.toml) ... done
  Created wheel for pandasai: filename=pandasai-1.2.6-py3-none-any.whl size=73182 sha256=a6977e8d2e7b742fd392d2ab6f64df39db9a10206ad4fd42b2e9ef1a9573e247
  Stored in directory: /root/.cache/pip/wheels/64/ad/76/cb845f7dfc4a8a5dd20bb3c92d40d0f9f024e2727bc6ac887f
Successfully built pandasai
Installing collected packages: pandasai
  Attempting uninstall: pandasai
    Found existing installation: pandasai 1.2.6
    Uninstalling pandasai-1.2.6:
      Successfully uninstalled pandasai-1.2.6
Successfully installed pandasai-1.2.6
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本

数据准备

我们将使用一些样本数据来演示如何使用 PandasAI。首先,我们需要创建一个 pandas DataFrame,并导入 PandasAI 库。

代码
文本
[2]
import pandas as pd
from pandasai import SmartDataframe

# 样本 DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
代码
文本

使用 PandasAI

代码
文本

实例化SmartDataframe

代码
文本
[3]
from pandasai.llm import OpenAI

llm = OpenAI(api_token="sk-xxx")
df = SmartDataframe(df, config={"llm": llm})
代码
文本

查询功能

PandasAI 允许你使用自然语言查询数据。在下面的示例中,我们将询问 DataFrame 中幸福指数排名前5的国家:

代码
文本
[4]
df.chat('Which are the 5 happiest countries?')
'The 5 happiest countries are: Canada, Australia, United Kingdom, Germany, United States'
代码
文本

图表功能

你还可以要求 PandasAI 为你绘制图形。例如,你可以通过以下方式要求绘制一个直方图:

代码
文本
[5]
df.chat("Plot the histogram of countries showing for each the gdp, using different colors for each bar")
代码
文本

多数据框功能

PandasAI 还允许你关联多个数据框并对它们进行查询:

代码
文本
[6]
import pandas as pd
from pandasai import SmartDatalake
from pandasai.llm import OpenAI

employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance']
}

salaries_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Salary': [5000, 6000, 4500, 7000, 5500]
}

employees_df = pd.DataFrame(employees_data)
salaries_df = pd.DataFrame(salaries_data)

dl = SmartDatalake([employees_df, salaries_df], config={"llm": llm})
dl.chat("Who gets paid the most?")
'Olivia gets paid the most.'
代码
文本

第二部分: 源码解读

代码
文本

项目源码可以在以下地址找到: GitHub源码仓库

代码解析部分, 我们主要探讨核心逻辑的代码实现。 核心逻辑为:构造提示 (prompt)、生成Python代码、执行代码和格式化输出这4个步骤。

我们进入到/Users/dp/learn/pandas-ai/pandasai/smart_datalake/init.py文件的SmartDatalake类下的chat函数中可以看到上述代码流程,下面我们将逐步分析 SmartDatalake 类中的 chat 函数,该函数是实现上述流程的核心。

代码
文本

1. 构造提示 (Prompt)

chat 函数中,首先通过 _get_prompt 方法来构造一个提示。该提示会被用来引导模型生成Python代码。提示的构造包括以下几个主要部分:

  1. 输入用户的上下文
  2. 当前query
  3. DataFrame的元信息

示例Prompt

You are provided with the following pandas DataFrames with the following metadata:

{dataframes}

This is the initial python code to be updated:
# python
# TODO import all the dependencies required
{default_import}

# Analyze the data
# 1. Prepare: Preprocessing and cleaning data if necessary
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
# 3. Analyze: Conducting the actual analysis (if the user asks to create a chart save it to an image in exports/charts/temp_chart.png and do not show the chart.)
# 4. Output: return a dictionary of:
# - type (possible values "text", "number", "dataframe", "plot")
# - value (can be a string, a dataframe or the path of the plot, NOT a dictionary)
# Example output: {{ "type": "text", "value": "The average loan amount is $15,000." }}
def analyze_data(dfs: list[{engine_df_name}]) -> dict:
    # Code goes here (do not add comments)


# Declare a result variable
result = analyze_data(dfs)

Using the provided dataframes (`dfs`), update the python code based on the last user question:
{conversation}

Updated code:
代码
文本

2. 生成Python代码

在构造了提示之后,通过 _llm.generate_code 方法来根据提示生成Python代码。该步骤中还包括代码的解析和处理。

下面我们通过几个实例来查看生成的Python代码:

Case 1: 查询5个最快乐的国家

Query:

Which are the 5 happiest countries?

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    # Code goes here (do not add comments)
    ...
    # Select the top 5 happiest countries
    top_5_happiest_countries = df_sorted.head(5)
    ...
    return {"type": "dataframe", "value": top_5_happiest_countries}

# Declare a result variable
result = analyze_data(dfs)

Case 2: 查询2个最不快乐国家的GDP之和

Query:

What is the sum of the GDPs of the 2 unhappiest countries?

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    ...
    # Get the sum of the GDPs of the 2 unhappiest countries
    sum_gdp = df_sorted.head(2)['gdp'].sum()
    ...
    return {"type": "number", "value": sum_gdp}

# Declare a result variable
result = analyze_data(dfs)

Case 3: 绘制国家GDP的柱状图

Query:

Plot the histogram of countries showing for each the gdp, using different colors for each bar

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    ...
    df.plot(kind='bar', x='country', y='gdp', color='gdp', legend=False)
    ...
    return {"type": "plot", "value": "exports/charts/temp_chart.png"}

# Declare a result variable
result = analyze_data(dfs)

Case 4: 查询GDP最高的国家及其GDP

Query:

gdp最高的国家对应的gpd是多少

生成的代码:

def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    ...
    # Return the result
    return {"type": "text", "value": f"The GDP of the country with the highest GDP is {max_gdp}."}

# Declare a result variable
result = analyze_data(dfs)

此步骤中模型做了几件关键事情:

  1. 区分了输出内容的格式(text/dataframe/plot/number),以方便后续的格式化输出。
  2. 完成了从查询到具体指令的转换。
  3. 实现了数据的解读和解释。
代码
文本

3. 执行Python代码

执行生成的Python代码是在一个循环中完成的,这个循环有一个最大重试次数。如果代码执行成功,则进入下一步。如果代码执行失败,则捕获错误信息,利用错误信息构造新的提示,并再次调用模型生成代码。

错误处理和重试

在遇到错误时,系统会构造一个新的提示,包含错误信息,用来生成新的、修正的代码。

错误处理的提示示例:

You are provided with a pandas dataframe (df) with {num_rows} rows and {num_columns} columns.
This is the metadata of the dataframe:
{df_head}.
The user asked the following question:
{conversation}
You generated this python code:
{code}
It fails with the following error:
{error_returned}
Correct the python code and return a new python code (do not import anything) that fixes the above mentioned error. Do not generate the same code again.
代码
文本

4. 格式化输出

在执行Python代码后,会对结果进行格式化处理。格式化依据结果的类型而有所不同:

  1. 如果结果类型为dataframe,会用Polars对其进行转换,以优化内存,然后返回dataframe
  2. 如果结果类型为plot,会显示该图表
  3. 其他类型则返回result["value"]
代码
文本
LLM
数据分析
Pandas
ChatGPT Prompt
LLM数据分析PandasChatGPT Prompt
已赞1
本文被以下合集收录
收藏
chensa@dp.tech
更新于 2023-11-17
6 篇0 人关注
推荐阅读
公开
PandasAI介绍和原理解析_back
LLM数据分析PandasChatGPT Prompt
LLM数据分析PandasChatGPT Prompt
蔡恒兴
发布于 2023-09-24
1 转存文件
公开
19. Pandas怎样对每个分组应用apply函数?
pythonPandas数据分析
pythonPandas数据分析
panjw@dp.tech
更新于 2024-08-06