空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的知识库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

官能团的红外光谱特征吸收频率的预测

分子光谱

红外光谱

Machine Learning

分子光谱红外光谱Machine Learning

CLiu

更新于 2024-11-26

推荐镜像 :Basic Image:ubuntu22.04-py3.10-irkernel-r4.4.1

推荐机型 :c2_m4_cpu

数据集

ir-diazo(v1)

背景概述

红外光谱特征吸收频率的预测关键在于理解官能团的振动模式与其分子结构之间的复杂关系。科学问题本质上是如何通过定量方法构建映射关系，以便在不同环境条件下预测吸收峰。这涉及到官能团的电子环境、空间构型及其相互作用，从而影响其红外光谱特征的表现。

代码

文本

数据探索与预处理

代码

文本

[1]

#安装所需的库

!pip install rdkit==2023.9.5 scikit-learn==1.4.0 xgboost==2.0.3

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting rdkit==2023.9.5
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b1/5d/570131a2ccf83e42efe3d601cf7054cb8b0ec0b390d775813fca76af9d7b/rdkit-2023.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 4.4 MB/s eta 0:00:0000:0100:01
Collecting scikit-learn==1.4.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3f/61/047b353f0ad550226ef962da182b4a09b689eb6df6bd84a03e44f9ee95bb/scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 3.5 MB/s eta 0:00:0000:0100:01
Collecting xgboost==2.0.3
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c3/eb/496aa2f5d356af4185f770bc76055307f8d1870e11016b10fd779b21769c/xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl (297.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 297.1/297.1 MB 2.9 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: Pillow in /opt/mamba/lib/python3.10/site-packages (from rdkit==2023.9.5) (10.4.0)
Requirement already satisfied: numpy in /opt/mamba/lib/python3.10/site-packages (from rdkit==2023.9.5) (1.24.2)
Collecting threadpoolctl>=2.0.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4b/2c/ffbf7a134b9ab11a67b0cf0726453cedd9c5043a4fe7a35d1cefa9a1bcfb/threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/91/29/df4b9b42f2be0b623cbd5e2140cafcaa2bef0759a00b7b70104dcfe2fb51/joblib-1.4.2-py3-none-any.whl (301 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 848.6 kB/s eta 0:00:00a 0:00:01
Requirement already satisfied: scipy>=1.6.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn==1.4.0) (1.10.1)
Installing collected packages: threadpoolctl, rdkit, joblib, xgboost, scikit-learn
Successfully installed joblib-1.4.2 rdkit-2023.9.5 scikit-learn-1.4.0 threadpoolctl-3.5.0 xgboost-2.0.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

代码

文本

数据的初步探索

代码

文本

[2]

import json

import pandas as pd

代码

文本

[3]

# 文件路径

file_path = '/bohr/ir-diazo-cyyr/v1/diazo_ir_data.json'

# 读取JSON文件

with open(file_path, 'r') as file:

data = json.load(file)

# 将JSON数据转换为DataFrame

df = pd.DataFrame(list(data.items()), columns=['SMILES', 'IR_Characteristic_Peak'])

# 显示DataFrame前5行

print(df[:5])

                                              SMILES  IR_Characteristic_Peak
0   [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC=C(C=C1)C                  2102.0
1  [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC=C(C=C1...                  2107.0
2  [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC(=CC=C1)OC                  2106.0
3  [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC(=CC=C1...                  2107.0
4      [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC=NC=C1                  2114.0

代码

文本

使用 RDKit 对 SMILES 进行去重

代码

文本

在化学数据处理中，SMILES（Simplified Molecular Input Line Entry System）是一种常用的分子结构表示方法。然而，由于分子可能有不同的 SMILES 表达方式（即同一分子可能对应多个不同的 SMILES 字符串），因此直接使用字符串比较的方法可能无法准确地识别重复的分子。为了解决这个问题，我们可以使用 RDKit 提供的分子标准化和哈希方法来进行查重。

代码

文本

核心库介绍

Chem：RDKit 的核心模块，用于化学分子的读写和操作。
MolToSmiles：将分子对象转换为规范化的 SMILES 字符串。

代码

文本

[4]

from rdkit import Chem

from rdkit.Chem import MolToSmiles

代码

文本

使用 MolFromSmiles 将 SMILES 字符串转换为 RDKit 的分子对象。如果转换失败，返回 None。
使用 MolToSmiles 生成规范化的 SMILES 字符串，确保同一分子总是得到相同的 SMILES 表示。

代码

文本

[5]

def canonicalize_smiles(smiles):

try:

mol = Chem.MolFromSmiles(smiles) # 将 SMILES 转换为分子对象

if mol:

canon_smiles = Chem.MolToSmiles(mol, canonical=True) # 生成规范化的 SMILES

return canon_smiles

else:

return None

except:

return None

代码

文本

[6]

# 创建新列存储规范化的 SMILES

df['Canonical_SMILES'] = df['SMILES'].apply(canonicalize_smiles)

# 显示前5行数据

print(df.head())

                                              SMILES  IR_Characteristic_Peak  \
0   [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC=C(C=C1)C                  2102.0   
1  [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC=C(C=C1...                  2107.0   
2  [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC(=CC=C1)OC                  2106.0   
3  [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC(=CC=C1...                  2107.0   
4      [N+](=[N-])=C(C(=O)ON1C(CCC1=O)=O)C1=CC=NC=C1                  2114.0   

                                    Canonical_SMILES  
0        Cc1ccc(C(=[N+]=[N-])C(=O)ON2C(=O)CCC2=O)cc1  
1  [N-]=[N+]=C(C(=O)ON1C(=O)CCC1=O)c1ccc(C(F)(F)F...  
2       COc1cccc(C(=[N+]=[N-])C(=O)ON2C(=O)CCC2=O)c1  
3  [N-]=[N+]=C(C(=O)ON1C(=O)CCC1=O)c1cccc(C(F)(F)...  
4           [N-]=[N+]=C(C(=O)ON1C(=O)CCC1=O)c1ccncc1

代码

文本

[7]

# 删除规范化 SMILES 为空的行

df = df[df['Canonical_SMILES'].notnull()]

# 根据规范化的 SMILES 进行去重

df_unique = df.drop_duplicates(subset='Canonical_SMILES', keep='first')

# 重置索引

df_unique.reset_index(drop=True, inplace=True)

# 显示去重后的数据集信息

print(f"原始数据集大小：{len(df)}")

print(f"去重后数据集大小：{len(df_unique)}")

原始数据集大小：1827
去重后数据集大小：1827

代码

文本

为什么要进行 SMILES 查重？

数据质量提升：重复的分子可能会导致模型过拟合或偏差，去重可以提高数据质量。
准确性：由于 SMILES 表达方式的多样性，直接的字符串比较可能无法识别重复分子。使用 RDKit 规范化 SMILES，可以确保同一分子有一致的表示。

代码

文本

IR 特征峰值的分布与可视化

代码

文本

[8]

import pandas as pd

import matplotlib.pyplot as plt

# 检查并清洗数据

df = df.dropna(subset=['IR_Characteristic_Peak'])

df['IR_Characteristic_Peak'] = pd.to_numeric(df['IR_Characteristic_Peak'], errors='coerce')

# 设置图形大小

plt.figure(figsize=(10, 6))

# 定义 bin

bin_width = 10

min_value = df['IR_Characteristic_Peak'].min()

max_value = df['IR_Characteristic_Peak'].max()

bins = range(int(min_value), int(max_value) + bin_width, bin_width)

# 绘制直方图

plt.hist(df['IR_Characteristic_Peak'], bins=bins, edgecolor='black', color='skyblue')

# 添加标题和标签

plt.title('Distribution of IR Characteristic Peaks', fontsize=16)

plt.xlabel('IR Characteristic Peak (in 10 unit intervals)', fontsize=14)

plt.ylabel('Frequency', fontsize=14)

# 添加网格线

plt.grid(axis='y', alpha=0.75)

# 显示图形

plt.tight_layout()

plt.show()

代码

文本

特征工程

代码

文本

[9]

from rdkit.Chem import AllChem

import warnings

warnings.filterwarnings("ignore", category=UserWarning, module='rdkit')

warnings.filterwarnings('ignore')

from rdkit import rdBase

rdBase.DisableLog('rdApp.warning')

import matplotlib.pyplot as plts

import numpy as np

import math

from sklearn.pipeline import make_pipeline

代码

文本

[10]

def Diazo_process_smiles(smiles_str):

try:

mol = Chem.MolFromSmiles(smiles_str)

mol = Chem.AddHs(mol)

# Find the positively charged nitrogen atom (N+)

nplus_atoms = [atom.GetIdx() for atom in mol.GetAtoms() if atom.GetFormalCharge() == 1]

if not nplus_atoms:

return None, None, None, None

nplus = mol.GetAtomWithIdx(nplus_atoms[0])

# Find the carbon atom connected to N+

carbon_neighbors = [neighbor for neighbor in nplus.GetNeighbors() if neighbor.GetSymbol() == "C"]

if not carbon_neighbors:

return "NoCarbonFound", None, None, None

carbon = carbon_neighbors[0]

# Find other atoms connected to the carbon, excluding N+

connected_atoms = [neighbor for neighbor in carbon.GetNeighbors() if neighbor.GetIdx() != nplus.GetIdx()]

# Priority order for connected atoms

priority_order = [

('Cl', 'SINGLE'), ('S', 'AROMATIC'), ('S', 'SINGLE'), ('F', 'SINGLE'),

('O', 'AROMATIC'), ('O', 'DOUBLE'), ('O', 'SINGLE'),

('N', 'TRIPLE'), ('N', 'AROMATIC'), ('N', 'DOUBLE'), ('N', 'SINGLE'),

('C', 'TRIPLE'), ('C', 'AROMATIC'), ('C', 'DOUBLE'), ('C', 'SINGLE'),

('H', 'SINGLE')

]

# Create a list to store atom atomic number and connections

connections_list = []

for atom in connected_atoms:

atomic_num = atom.GetAtomicNum()

connected_atom_symbol = atom.GetSymbol()

neighbors_info = []

for neighbor in atom.GetNeighbors():

if neighbor.GetIdx() != carbon.GetIdx():

neighbor_symbol = neighbor.GetSymbol()

bond = mol.GetBondBetweenAtoms(atom.GetIdx(), neighbor.GetIdx())

bond_type_str = str(bond.GetBondType())

neighbors_info.append((neighbor_symbol, bond_type_str))

connections_dict = {'atomic_num': atomic_num, 'connections': {connected_atom_symbol: neighbors_info}}

connections_list.append(connections_dict)

# Determine R1 and R2 based on atomic number and connection priority

connections_list.sort(key=lambda x: x['atomic_num'], reverse=True)

if len(connections_list) > 1 and connections_list[0]['atomic_num'] == connections_list[1]['atomic_num']:

# If atomic numbers are the same, sort by connection priority

connections_list.sort(key=lambda x: min(priority_order.index(y) if y in priority_order else len(priority_order)

for y in [item for sublist in x['connections'].values() for item in sublist]))

R1, R2 = connections_list[0], connections_list[1] if len(connections_list) > 1 else None

atomic_number_R1 = R1['atomic_num']

atomic_number_R2 = R2['atomic_num'] if R2 else None

return R1, R2, atomic_number_R1, atomic_number_R2

except Exception as e:

print(f"An error occurred: {e}")

return None, None, None, None

代码

文本

[11]

def calculate_morgan_fingerprint(smiles, n_bits=2048):

mol = Chem.MolFromSmiles(smiles)

fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=n_bits)

return list(fp)

代码

文本

[12]

def Feature_Engineering():

possible_columns = [

'Cl_SINGLE_R1', 'Cl_SINGLE_R2', 'S_AROMATIC_R1', 'S_AROMATIC_R2', 'S_SINGLE_R1', 'S_SINGLE_R2',

'F_SINGLE_R1', 'F_SINGLE_R2', 'O_AROMATIC_R1', 'O_AROMATIC_R2', 'O_DOUBLE_R1', 'O_DOUBLE_R2', 'O_SINGLE_R1', 'O_SINGLE_R2',

'N_TRIPLE_R2', 'N_TRIPLE_R1', 'N_AROMATIC_R1', 'N_AROMATIC_R2', 'N_DOUBLE_R1', 'N_DOUBLE_R2', 'N_SINGLE_R1', 'N_SINGLE_R2',

'C_TRIPLE_R1', 'C_TRIPLE_R2', 'C_AROMATIC_R2', 'C_AROMATIC_R1', 'C_DOUBLE_R1', 'C_DOUBLE_R2', 'C_SINGLE_R1', 'C_SINGLE_R2',

'H_SINGLE_R1', 'H_SINGLE_R2'

]

results = []

for index, row in df.iterrows():

smiles_str = row['SMILES']

R1, R2, atomic_number_R1, atomic_number_R2 = Diazo_process_smiles(smiles_str)#here to process target FG

# 初始化计数字典

counts = {'R1' : R1, 'R2' : R2, 'atomic_number_R1': atomic_number_R1, 'atomic_number_R2': atomic_number_R2, 'SMILES': smiles_str, 'IR_Characteristic_Peak': row['IR_Characteristic_Peak']}

counts.update({col: 0 for col in possible_columns})

if R1 == "NoCarbonFound":

continue

fingerprint = calculate_morgan_fingerprint(smiles_str)

for i, bit in enumerate(fingerprint):

counts[f'Fingerprint_{i}'] = bit

for suffix, connections in [('R1', R1), ('R2', R2)]:

if connections:

for connection in connections['connections'].values():

for bond in connection:

bond_type_str = f'{bond[0]}_{bond[1]}'

counts[f'{bond_type_str}_{suffix}'] = counts.get(f'{bond_type_str}_{suffix}', 0) + 1

results.append(counts)

results_df = pd.DataFrame(results)

results_df_filled = results_df.fillna(0)

electronegativity_dict = {

1: 2.20, 5: 2.04, 6: 2.55, 7: 3.04, 8: 3.44, 9: 3.98, 14: 1.90, 15: 2.19, 16: 2.58, 17: 3.16, 35: 2.96, 53: 2.66

}

# Define the covalent radius dictionary

covalent_radius_dict = {

1: 37, 6: 77, 7: 75, 8: 73, 9: 71, 14: 111, 15: 106, 16: 102, 17: 99, 35: 114, 53: 133

}

# Add electronegativity and covalent radius columns

results_df_filled['electronegativity_R1'] = results_df_filled['atomic_number_R1'].map(electronegativity_dict)

results_df_filled['electronegativity_R2'] = results_df_filled['atomic_number_R2'].map(electronegativity_dict)

results_df_filled['covalent_radius_R1'] = results_df_filled['atomic_number_R1'].map(covalent_radius_dict)

results_df_filled['covalent_radius_R2'] = results_df_filled['atomic_number_R2'].map(covalent_radius_dict)

# Replace NaN values if necessary (optional)

results_df_filled['electronegativity_R1'].fillna(0, inplace=True)

results_df_filled['electronegativity_R2'].fillna(0, inplace=True)

results_df_filled['covalent_radius_R1'].fillna(0, inplace=True)

results_df_filled['covalent_radius_R2'].fillna(0, inplace=True)

print("FE Processing complete.")

return results_df_filled

代码

文本

数据集划分

代码

文本

在特征工程完成后，我们需要将数据集划分为训练集和测试集。

代码

文本

定义特征和目标变量

代码

文本

目标变量 y：我们要预测的值，即红外特征峰值。特征列表 feature：手工提取的化学特征，基于领域知识。摩根指纹特征 Morgan_features：用于描述分子结构的二进制向量。特征矩阵 X：模型的输入数据。

代码

文本

[13]

# 调用特征工程函数，获取处理后的数据框

df = Feature_Engineering()

# 定义目标变量

y = df['IR_Characteristic_Peak'].values # 目标变量：红外特征峰值

# 特征列表

feature = [

'electronegativity_R1', 'electronegativity_R2', 'covalent_radius_R1', 'covalent_radius_R2',

'Cl_SINGLE_R1', 'Cl_SINGLE_R2', 'S_AROMATIC_R1', 'S_AROMATIC_R2',

'S_SINGLE_R1', 'S_SINGLE_R2', 'F_SINGLE_R1', 'F_SINGLE_R2',

'O_AROMATIC_R1', 'O_AROMATIC_R2', 'O_DOUBLE_R1', 'O_DOUBLE_R2',

'O_SINGLE_R1', 'O_SINGLE_R2', 'N_TRIPLE_R2', 'N_TRIPLE_R1',

'N_AROMATIC_R1', 'N_AROMATIC_R2', 'N_DOUBLE_R1', 'N_DOUBLE_R2',

'N_SINGLE_R1', 'N_SINGLE_R2', 'C_TRIPLE_R1', 'C_TRIPLE_R2',

'C_AROMATIC_R2', 'C_AROMATIC_R1', 'C_DOUBLE_R1', 'C_DOUBLE_R2',

'C_SINGLE_R1', 'C_SINGLE_R2', 'H_SINGLE_R1', 'H_SINGLE_R2'

]

# 定义摩根指纹特征列表

Morgan_features = [f'Fingerprint_{i}' for i in range(2048)]

# 合并所有特征

all_features = feature + Morgan_features

# 提取特征矩阵

X = df[all_features].values

FE Processing complete.

代码

文本

划分训练集和测试集

代码

文本

基于数据的随机划分

代码

文本

使用train_test_split方法，将数据集随机划分为训练集和测试集。

代码

文本

[14]

from sklearn.model_selection import train_test_split

# 将数据集划分为训练集和测试集

X_train_val, X_test, y_train_val, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

代码

文本

基于分子骨架的随机划分

代码

文本

数据标准化

代码

文本

为了提高模型的训练效果和稳定性，我们需要对数据进行标准化处理。

代码

文本

初始化标准化器

代码

文本

StandardScaler()：将特征缩放为均值为 0，标准差为 1。在训练集上拟合标准化器：确保标准化的均值和标准差基于训练集，避免数据泄漏。对测试集进行转换：使用在训练集上计算的参数，对测试集进行相同的转换。

代码

文本

[15]

from sklearn.preprocessing import StandardScaler

# 初始化标准化器

scaler = StandardScaler()

代码

文本

对训练集进行标准化（先拟合后转换）

代码

文本

拟合（fit）：scaler.fit(X_train_val) 计算训练集的均值和标准差。这些参数将用于后续的数据标准化。转换（transform）：scaler.transform(X_train_val) 使用计算得到的均值和标准差，将训练集数据标准化。 fit_transform：是 fit 和 transform 的组合方法，简化了代码。

代码

文本

[16]

X_train_val = scaler.fit_transform(X_train_val)

代码

文本

对测试集进行标准化（直接转换）

代码

文本

直接转换（transform）：在测试集上仅使用 transform 方法，而不再调用 fit。这是因为测试集的标准化必须基于训练集的均值和标准差，以避免信息泄漏。通过这种方式，确保模型在预测时只依赖于训练数据的信息，从而保持模型的泛化能力。

代码

文本

[17]

# 在标准化之前保存原始的 X_test

X_test_original = X_test.copy()

X_test = scaler.transform(X_test)

代码

文本

模型训练与评估

代码

文本

在本教程中，我们将使用 XGBoost 回归模型XGBRegressor进行超参数调优和模型评估。我们将逐步讲解每个步骤，以便更好地理解整个过程。

代码

文本

定义模型和需要调优的超参数

代码

文本

参数说明：

n_estimators：要训练的决策树的数量。更多的树可能提高模型的性能，但也会增加计算时间。 max_depth：每棵树的最大深度。较大的深度可能捕获更多的模式，但也可能导致过拟合。 learning_rate：学习率，控制每棵树对最终模型的贡献。较小的学习率通常需要更多的树。 subsample：训练每棵树时使用的样本比例。通过取小于 1.0 的值，可以防止过拟合。

代码

文本

[18]

from xgboost import XGBRegressor

# 定义模型

model = XGBRegressor(random_state=42)

# 定义需要调优的超参数

# params = {

# 'n_estimators': [100, 200], # 树的数量

# 'max_depth': [3, 5, 7], # 树的最大深度

# 'learning_rate': [0.01, 0.1, 0.2], # 学习率

# 'subsample': [0.6, 0.8, 1.0], # 子样本比例

# }

params = {

'n_estimators': [500], # 树的数量

'max_depth': [7], # 树的最大深度

'learning_rate': [0.1], # 学习率

'subsample': [0.8], # 子样本比例

}

代码

文本

设置交叉验证策略

代码

文本

n_splits=10：将数据分成 10 份，每次使用 9 份训练，1 份验证。 shuffle=True：在分割之前打乱数据，以确保每个折的数据分布尽可能一致。 random_state：设置随机种子，保证结果的可重复性。

代码

文本

[19]

from sklearn.model_selection import KFold

# 10折交叉验证

kf = KFold(n_splits=10, shuffle=True, random_state=42)

代码

文本

使用 GridSearchCV 进行超参数调优

代码

文本

步骤解析：

创建 GridSearchCV 对象：

estimator：指定要训练的模型。 param_grid：提供要调优的超参数及其取值范围。 scoring：指定评估指标。由于 GridSearchCV 默认寻找评分最高的模型，因此使用负的均方误差。 cv：指定交叉验证策略，这里使用之前定义的 kf。 n_jobs=-1：使用所有可用的 CPU 核心，加速计算。进行网格搜索：

fit 方法：在训练数据上搜索最佳超参数组合。GridSearchCV 会尝试每一种参数组合，并使用交叉验证评估性能。

代码

文本

[20]

from sklearn.model_selection import GridSearchCV

# 使用 GridSearchCV 进行超参数调优

grid_search = GridSearchCV(

estimator=model,

param_grid=params,

scoring='neg_mean_squared_error', # 使用负的均方误差作为评分标准

cv=kf,

n_jobs=-1 # 使用所有可用的 CPU 核心进行并行计算

)

# 在训练集上进行超参数调优

grid_search.fit(X_train_val, y_train_val)

GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None,
                                    grow_policy=None, importance_type=None,
                                    int...
                                    max_cat_to_onehot=None, max_delta_step=None,
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    multi_strategy=None, n_estimators=None,
                                    n_jobs=None, num_parallel_tree=None,
                                    random_state=42, ...),
             n_jobs=-1,
             param_grid={'learning_rate': [0.1], 'max_depth': [7],
                         'n_estimators': [500], 'subsample': [0.8]},
             scoring='neg_mean_squared_error')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV?Documentation for GridSearchCViFitted

GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None,
                                    grow_policy=None, importance_type=None,
                                    int...
                                    max_cat_to_onehot=None, max_delta_step=None,
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    multi_strategy=None, n_estimators=None,
                                    n_jobs=None, num_parallel_tree=None,
                                    random_state=42, ...),
             n_jobs=-1,
             param_grid={'learning_rate': [0.1], 'max_depth': [7],
                         'n_estimators': [500], 'subsample': [0.8]},
             scoring='neg_mean_squared_error')

estimator: XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

代码

文本

查看最佳超参数组合

代码

文本

找到最佳超参数组合后，我们可以查看结果。

grid_search.best_params_：返回具有最佳性能的超参数组合。通过打印该结果，我们可以了解模型的最佳配置。

代码

文本

[21]

# 输出最佳超参数组合

print('最佳超参数组合:', grid_search.best_params_)

最佳超参数组合: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 500, 'subsample': 0.8}

代码

文本

使用最佳超参数在整个训练集上训练模型

代码

文本

使用找到的最佳超参数组合，在整个训练集上训练模型。

best_estimator_：获取使用最佳超参数组合的模型实例。再次调用 fit 方法：在整个训练集上训练模型，以充分利用所有可用的数据，提高模型的泛化能力。

代码

文本

[22]

# 使用最佳超参数组合和整个训练集训练模型

best_model = grid_search.best_estimator_

best_model.fit(X_train_val, y_train_val)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=7, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=500, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

代码

文本

在测试集上评估模型性能

代码

文本

决定系数（R²）：反映模型解释变量变异的程度，取值范围为 0 到 1。值越接近 1，模型的解释能力越强。 $R^{2} = 1 - \frac{\sum _{i = 1}^{n} ( y _{i} - y ^ _{i} ) ^{2}}{\sum _{i = 1}^{n} ( y _{i} - y ˉ ) ^{2}}$ $其中， y_{i} 是实际观测值。 \overset{y}{^}_{i} 是模型预测值。 \overset{y}{ˉ} 是实际观测值的均值。分子是残差平方和，分母是总平方和。$
均方根误差（RMSE）：均方误差（MSE）的平方根，表示预测值与真实值之间的平均偏差。值越小，模型性能越好。 $RMSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}$ $其中， n 是样本大小。 y_{i} 和 \overset{y}{^}_{i} 同样是实际观测值和模型预测值。计算方法是先求出每个预测值的误差平方和，取其平均值后开方。$

代码

文本

[23]

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# 在测试集上评估模型性能

y_test_pred = best_model.predict(X_test)

rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

r2_test = r2_score(y_test, y_test_pred)

print(f'测试集 RMSE: {rmse_test:.3f}')

print(f'测试集 R²: {r2_test:.3f}')

测试集 RMSE: 4.144
测试集 R²: 0.959

代码

文本

[62]

import matplotlib.pyplot as plt

import numpy as np

from sklearn.metrics import mean_squared_error, r2_score

# Calculate additional metrics

rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

r2_test = r2_score(y_test, y_test_pred)

# Create scatter plot with additional customizations

plt.figure(figsize=(10, 8))

# Scatter plot with color map based on prediction values

scatter = plt.scatter(y_test, y_test_pred, c=np.abs(y_test_pred - y_test), cmap='viridis', alpha=0.6, edgecolors='w', s=200)

# Add a colorbar to show the magnitude of error

cbar = plt.colorbar(scatter, label='Absolute Error (cm$^{-1}$)')

# Adjust the font size of the colorbar label

cbar.set_label('Absolute Error (cm$^{-1}$)', fontsize=14)

# Adjust the font size of the colorbar ticks

cbar.ax.tick_params(labelsize=14) # Set tick font size on the colorbar

# Plot the ideal line (y = x)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2, label='Ideal Prediction')

# Add text annotation for RMSE and R^2

plt.text(0.07, 0.82, f'R$^2$: {r2_test:.3f}', transform=plt.gca().transAxes, fontsize=16, verticalalignment='top', color='black')

plt.text(0.07, 0.72, f'RMSE: {rmse_test:.3f}', transform=plt.gca().transAxes, fontsize=16, verticalalignment='top', color='black')

# Set labels and title

plt.xlabel('Observed Value (cm$^{-1}$)', fontsize=14)

plt.ylabel('Predicted Value (cm$^{-1}$)', fontsize=14)

# Adjust tick label font size

plt.tick_params(axis='both', which='major', labelsize=14) # Set major ticks font size

plt.grid(True)

# Show the plot

plt.show()

代码

文本

结果分析

代码

文本

基于SHAPLEY VALUE的模型可解释性分析

代码

文本

[24]

# 安装 SHAP 库

!pip install shap

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting shap
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/13/a6/b75760a52664dd82d530f9e232918bb74d1d6c39abcf34523c4f75cd4264/shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 540.1/540.1 kB 13.7 MB/s eta 0:00:00a 0:00:01
Collecting numba
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/58/cb4ac5b8f7ec64200460aef1fed88258fb872ceef504ab1f989d2ff0f684/numba-0.60.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 MB 36.4 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: numpy in /opt/mamba/lib/python3.10/site-packages (from shap) (1.24.2)
Requirement already satisfied: scikit-learn in /opt/mamba/lib/python3.10/site-packages (from shap) (1.4.0)
Requirement already satisfied: tqdm>=4.27.0 in /opt/mamba/lib/python3.10/site-packages (from shap) (4.64.1)
Requirement already satisfied: pandas in /opt/mamba/lib/python3.10/site-packages (from shap) (1.5.3)
Collecting cloudpickle
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/48/41/e1d85ca3cab0b674e277c8c4f678cf66a91cd2cecf93df94353a606fe0db/cloudpickle-3.1.0-py3-none-any.whl (22 kB)
Collecting slicer==0.0.8
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/63/81/9ef641ff4e12cbcca30e54e72fb0951a2ba195d0cda0ba4100e532d929db/slicer-0.0.8-py3-none-any.whl (15 kB)
Requirement already satisfied: scipy in /opt/mamba/lib/python3.10/site-packages (from shap) (1.10.1)
Requirement already satisfied: packaging>20.9 in /opt/mamba/lib/python3.10/site-packages (from shap) (23.0)
Collecting llvmlite<0.44,>=0.43.0dev0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c6/21/2ffbab5714e72f2483207b4a1de79b2eecd9debbf666ff4e7067bcc5c134/llvmlite-0.43.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.9/43.9 MB 14.1 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas->shap) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: joblib>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn->shap) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn->shap) (3.5.0)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
Installing collected packages: slicer, llvmlite, cloudpickle, numba, shap
Successfully installed cloudpickle-3.1.0 llvmlite-0.43.0 numba-0.60.0 shap-0.46.0 slicer-0.0.8
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

代码

文本

[25]

import shap

# 初始化 SHAP 解释器

explainer = shap.TreeExplainer(best_model)

代码

文本

[26]

# 选择要解释的数据集，通常使用测试集

X_explain = X_test # 测试集的特征数据

# 计算 SHAP 值

shap_values = explainer.shap_values(X_explain)

代码

文本

[27]

# 获取选定特征的索引

indices = [all_features.index(feat) for feat in feature]

# 创建仅包含选定特征的 SHAP 值和特征数据

shap_value = shap_values[:, indices]

X_explained = X_explain[:, indices]

代码

文本

全局性解释

代码

文本

[28]

# 绘制 SHAP Summary Plot

plt.figure(figsize=(12, 8))

shap.summary_plot(shap_value, X_explained, feature_names=feature, show=False)

plt.show()

代码

文本

局部性解释

代码

文本

[29]

# 选择一个样本进行解释

sample_index = 0 # 可以选择其他索引

# 提取 SHAP 值和原始特征值

sample_shap_values = shap_values[sample_index] # SHAP 值

sample_features_original = X_test_original[sample_index] # 原始的特征值

# 绘制 SHAP Force Plot，显示原始特征值

shap.force_plot(explainer.expected_value, sample_shap_values, sample_features_original, feature_names=feature, matplotlib=True, show=False)

plt.show()

代码

文本

分子光谱

红外光谱

Machine Learning

分子光谱红外光谱Machine Learning

点个赞吧

本文被以下合集收录

分子光谱建模和预测

CLiu

更新于 2024-09-23

3 篇1 人关注