PandasAI是一个给pandas加持了大模型能力的Python库,可以提升你的数据分析和处理能力。
本教程包含两部分:
- 第一部分对PandasAI的功能进行介绍
- 第二部分对源码进行解析, 了解PandasAI是如何具体实现这些炫酷的功能
第一部分: 功能介绍
环境准备
首先,你需要在你的Python环境中安装 PandasAI 库。你可以通过以下命令进行安装:
Cloning into 'pandas-ai'... remote: Enumerating objects: 3903, done. remote: Counting objects: 100% (542/542), done. remote: Compressing objects: 100% (265/265), done. remote: Total 3903 (delta 349), reused 415 (delta 257), pack-reused 3361 Receiving objects: 100% (3903/3903), 2.17 MiB | 301.00 KiB/s, done. Resolving deltas: 100% (2613/2613), done. Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Processing /pandas-ai Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: duckdb<0.9.0,>=0.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (0.8.1) Requirement already satisfied: scipy<2.0.0,>=1.9.0 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.10.1) Requirement already satisfied: pandas==1.5.3 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.5.3) Requirement already satisfied: matplotlib<4.0.0,>=3.7.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (3.8.0) Requirement already satisfied: openai<0.28.0,>=0.27.5 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (0.27.10) Requirement already satisfied: python-dotenv<2.0.0,>=1.0.0 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.0.0) Requirement already satisfied: pydantic<2,>=1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.10.12) Requirement already satisfied: ipython<9.0.0,>=8.13.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (8.15.0) Requirement already satisfied: sqlalchemy<2.0.0,>=1.4.49 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (1.4.49) Requirement already satisfied: astor<0.9.0,>=0.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandasai==1.2.6) (0.8.1) Requirement already satisfied: numpy>=1.21.0 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (1.24.2) Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2022.7.1) Requirement already satisfied: python-dateutil>=2.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandas==1.5.3->pandasai==1.2.6) (2.8.2) Requirement already satisfied: decorator in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.1.1) Requirement already satisfied: backcall in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.0) Requirement already satisfied: pexpect>4.3 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (4.8.0) Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (3.0.38) Requirement already satisfied: exceptiongroup in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (1.1.3) Requirement already satisfied: pickleshare in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.5) Requirement already satisfied: stack-data in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.6.2) Requirement already satisfied: traitlets>=5 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (5.9.0) Requirement already satisfied: jedi>=0.16 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.18.2) Requirement already satisfied: matplotlib-inline in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.1.6) Requirement already satisfied: pygments>=2.4.0 in /opt/mamba/lib/python3.10/site-packages (from ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.14.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (4.42.1) Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (0.11.0) Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (23.0) Requirement already satisfied: pillow>=6.2.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (10.0.1) Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (3.1.1) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (1.4.5) Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib<4.0.0,>=3.7.1->pandasai==1.2.6) (1.1.1) Requirement already satisfied: requests>=2.20 in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.28.1) Requirement already satisfied: aiohttp in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (3.8.5) Requirement already satisfied: tqdm in /opt/mamba/lib/python3.10/site-packages (from openai<0.28.0,>=0.27.5->pandasai==1.2.6) (4.64.1) Requirement already satisfied: typing-extensions>=4.2.0 in /opt/mamba/lib/python3.10/site-packages (from pydantic<2,>=1->pandasai==1.2.6) (4.5.0) Requirement already satisfied: greenlet!=0.4.17 in /opt/mamba/lib/python3.10/site-packages (from sqlalchemy<2.0.0,>=1.4.49->pandasai==1.2.6) (2.0.2) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/mamba/lib/python3.10/site-packages (from jedi>=0.16->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.8.3) Requirement already satisfied: ptyprocess>=0.5 in /opt/mamba/lib/python3.10/site-packages (from pexpect>4.3->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.7.0) Requirement already satisfied: wcwidth in /opt/mamba/lib/python3.10/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.6) Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas==1.5.3->pandasai==1.2.6) (1.16.0) Requirement already satisfied: charset-normalizer<3,>=2 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2.1.1) Requirement already satisfied: certifi>=2017.4.17 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (2022.9.24) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.26.11) Requirement already satisfied: idna<4,>=2.5 in /opt/mamba/lib/python3.10/site-packages (from requests>=2.20->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (3.4) Requirement already satisfied: aiosignal>=1.1.2 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.3.1) Requirement already satisfied: attrs>=17.3.0 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (22.2.0) Requirement already satisfied: multidict<7.0,>=4.5 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (6.0.4) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (4.0.3) Requirement already satisfied: yarl<2.0,>=1.0 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.9.2) Requirement already satisfied: frozenlist>=1.1.1 in /opt/mamba/lib/python3.10/site-packages (from aiohttp->openai<0.28.0,>=0.27.5->pandasai==1.2.6) (1.4.0) Requirement already satisfied: pure-eval in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (0.2.2) Requirement already satisfied: asttokens>=2.1.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (2.2.1) Requirement already satisfied: executing>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from stack-data->ipython<9.0.0,>=8.13.1->pandasai==1.2.6) (1.2.0) Building wheels for collected packages: pandasai Building wheel for pandasai (pyproject.toml) ... done Created wheel for pandasai: filename=pandasai-1.2.6-py3-none-any.whl size=73182 sha256=a6977e8d2e7b742fd392d2ab6f64df39db9a10206ad4fd42b2e9ef1a9573e247 Stored in directory: /root/.cache/pip/wheels/64/ad/76/cb845f7dfc4a8a5dd20bb3c92d40d0f9f024e2727bc6ac887f Successfully built pandasai Installing collected packages: pandasai Attempting uninstall: pandasai Found existing installation: pandasai 1.2.6 Uninstalling pandasai-1.2.6: Successfully uninstalled pandasai-1.2.6 Successfully installed pandasai-1.2.6 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
数据准备
我们将使用一些样本数据来演示如何使用 PandasAI。首先,我们需要创建一个 pandas DataFrame,并导入 PandasAI 库。
使用 PandasAI
实例化SmartDataframe
查询功能
PandasAI 允许你使用自然语言查询数据。在下面的示例中,我们将询问 DataFrame 中幸福指数排名前5的国家:
'The 5 happiest countries are: Canada, Australia, United Kingdom, Germany, United States'
图表功能
你还可以要求 PandasAI 为你绘制图形。例如,你可以通过以下方式要求绘制一个直方图:
多数据框功能
PandasAI 还允许你关联多个数据框并对它们进行查询:
'Olivia gets paid the most.'
第二部分: 源码解读
项目源码可以在以下地址找到: GitHub源码仓库
代码解析部分, 我们主要探讨核心逻辑的代码实现。 核心逻辑为:构造提示 (prompt)、生成Python代码、执行代码和格式化输出这4个步骤。
我们进入到/Users/dp/learn/pandas-ai/pandasai/smart_datalake/init.py文件的SmartDatalake类下的chat函数中可以看到上述代码流程,下面我们将逐步分析 SmartDatalake
类中的 chat
函数,该函数是实现上述流程的核心。
1. 构造提示 (Prompt)
在 chat
函数中,首先通过 _get_prompt
方法来构造一个提示。该提示会被用来引导模型生成Python代码。提示的构造包括以下几个主要部分:
- 输入用户的上下文
- 当前query
- DataFrame的元信息
示例Prompt
You are provided with the following pandas DataFrames with the following metadata:
{dataframes}
This is the initial python code to be updated:
# python
# TODO import all the dependencies required
{default_import}
# Analyze the data
# 1. Prepare: Preprocessing and cleaning data if necessary
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
# 3. Analyze: Conducting the actual analysis (if the user asks to create a chart save it to an image in exports/charts/temp_chart.png and do not show the chart.)
# 4. Output: return a dictionary of:
# - type (possible values "text", "number", "dataframe", "plot")
# - value (can be a string, a dataframe or the path of the plot, NOT a dictionary)
# Example output: {{ "type": "text", "value": "The average loan amount is $15,000." }}
def analyze_data(dfs: list[{engine_df_name}]) -> dict:
# Code goes here (do not add comments)
# Declare a result variable
result = analyze_data(dfs)
Using the provided dataframes (`dfs`), update the python code based on the last user question:
{conversation}
Updated code:
2. 生成Python代码
在构造了提示之后,通过 _llm.generate_code
方法来根据提示生成Python代码。该步骤中还包括代码的解析和处理。
下面我们通过几个实例来查看生成的Python代码:
Case 1: 查询5个最快乐的国家
Query:
Which are the 5 happiest countries?
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
# Code goes here (do not add comments)
...
# Select the top 5 happiest countries
top_5_happiest_countries = df_sorted.head(5)
...
return {"type": "dataframe", "value": top_5_happiest_countries}
# Declare a result variable
result = analyze_data(dfs)
Case 2: 查询2个最不快乐国家的GDP之和
Query:
What is the sum of the GDPs of the 2 unhappiest countries?
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
...
# Get the sum of the GDPs of the 2 unhappiest countries
sum_gdp = df_sorted.head(2)['gdp'].sum()
...
return {"type": "number", "value": sum_gdp}
# Declare a result variable
result = analyze_data(dfs)
Case 3: 绘制国家GDP的柱状图
Query:
Plot the histogram of countries showing for each the gdp, using different colors for each bar
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
...
df.plot(kind='bar', x='country', y='gdp', color='gdp', legend=False)
...
return {"type": "plot", "value": "exports/charts/temp_chart.png"}
# Declare a result variable
result = analyze_data(dfs)
Case 4: 查询GDP最高的国家及其GDP
Query:
gdp最高的国家对应的gpd是多少
生成的代码:
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
...
# Return the result
return {"type": "text", "value": f"The GDP of the country with the highest GDP is {max_gdp}."}
# Declare a result variable
result = analyze_data(dfs)
此步骤中模型做了几件关键事情:
- 区分了输出内容的格式(text/dataframe/plot/number),以方便后续的格式化输出。
- 完成了从查询到具体指令的转换。
- 实现了数据的解读和解释。
3. 执行Python代码
执行生成的Python代码是在一个循环中完成的,这个循环有一个最大重试次数。如果代码执行成功,则进入下一步。如果代码执行失败,则捕获错误信息,利用错误信息构造新的提示,并再次调用模型生成代码。
错误处理和重试
在遇到错误时,系统会构造一个新的提示,包含错误信息,用来生成新的、修正的代码。
错误处理的提示示例:
You are provided with a pandas dataframe (df) with {num_rows} rows and {num_columns} columns.
This is the metadata of the dataframe:
{df_head}.
The user asked the following question:
{conversation}
You generated this python code:
{code}
It fails with the following error:
{error_returned}
Correct the python code and return a new python code (do not import anything) that fixes the above mentioned error. Do not generate the same code again.
4. 格式化输出
在执行Python代码后,会对结果进行格式化处理。格式化依据结果的类型而有所不同:
- 如果结果类型为dataframe,会用Polars对其进行转换,以优化内存,然后返回dataframe
- 如果结果类型为plot,会显示该图表
- 其他类型则返回result["value"]