Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
高聚物PE裂解主成分分析(PCA) ---*******
pca
python
pcapython
zzh
更新于 2024-09-23
推荐镜像 :DeePMD-kit:2024Q1-d23cf3e
推荐机型 :c2_m4_cpu
赞 1
1
1
高聚物裂解(v4)

©️ Copyright 2024 @ Authors
日期:2024-06-17
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择 deepmd-kit:2024Q1-d23cf3e 镜像及 c2_m4_cpu 节点配置,稍等片刻即可运行。

代码
文本

通过数值模拟方法研究了聚乙烯(PE,聚合度30,单链体系)材料在高温无氧环境下的热解降解反应机理。采用ReaxFF力场,通过LAMMPS平台在无氧环境下模拟了 PE的热解降解反应。使用PCA技术对模拟数据进行降维分析,识别出关键物质在热解降解过程中的变化趋势和重要中间体。同时聚类分析显示部分中间产物之间的相似性,为进一步的反应机制研究提供理论支持。

代码
文本

Principal Component Analysis(PCA)

PCA是一种常用的线性降维技术,其主要思想是通过线性变换将原始特征转换成一组新的正交特征,称为主成分,其中每个主成分都是原始特征的线性组合。PCA的步骤包括计算数据的协方差矩阵、计算特征值和特征向量、选择主成分并投影数据。在化学物质的PCA分析中,这些主成分代表了数据中的主要方差方向,即数据中最显著的变化方向。比如在本实验对大量化学物质的处理中,主成分为所有物质按其特征向量绝对值的大小排列形成的主成分。

代码
文本

1. 数据准备

分子动力学模拟之前需要准备繁杂的前置文件。包括lmp_control文件,param.qeq文件,ffield.reax.cho力场文件,data.CHO文件,input.files文件。

代码
文本

我们已经为您准备了运行 LAMMPS 计算所需的lmp_control文件,param.qeq文件,ffield.reax.cho力场文件,data.CHO文件,input.files文件,初始数据,并将其放置在文件夹 pca 中。你可以在左侧点击数据集查看相应文件:

代码
文本
[13]
# 出于安全考虑,我们没有数据集所在文件夹的写入权限,因此我们将其复制到 `/data/` 目录下:
! cp -nr /bohr/ /personal/

# 我们在这里定义一些路径,并切换到工作路径,方便后续调用:
import os
bohr_dataset_url = "/bohr/lammps-qepa/v4/pca/" # url 可从左侧数据集复制
work_path = os.path.join("/personal", bohr_dataset_url[1:]) # 进行一个切片以删除上述路径中最开始的“/”
os.chdir(work_path)
print(f"当前路径为:{os.getcwd()}")
当前路径为:/personal/bohr/lammps-qepa/v4/pca
代码
文本

让我们来查看下载的 pca文件夹:

代码
文本
[14]
! tree /personal/bohr/lammps-qepa/v4/pca/ -L 1
/personal/bohr/lammps-qepa/v4/pca/
├── bond.CHO
├── data.CHO
├── ffield.reax.cho
├── input.pe.files
├── lmp_control
├── param.qeq
├── pe.csv
└── pe.txt

0 directories, 8 files
代码
文本

在 lammps-qepa 文件夹下有lmp_control文件,param.qeq文件,ffield.reax.cho力场文件,data.CHO文件,input.files文件,清洗前的pe.txt文件,清洗后的pe.csv文件

代码
文本

其中lmp_control 是 LAMMPS(Large-scale Atomic/Molecular Massively Parallel Simulator)计算中用于控制模拟参数和设置的输入文件。它通常用于与 ReaxFF势能函数文件配合使用,以设置模拟的各种参数,如温度,压力,时间步长,模拟时间等。本实验的该文件是通过LAMMPS官方测试用例获得。 param.qeq 是用于描述与电荷平衡(QEq,Charge Equilibration)算法相关的参数。其中有计算体系中的原子电荷分布,以更准确地模拟分子间和分子内的相互作用。这些参数可以帮助计算模拟过程中电荷转移和电荷分布,对模拟涉及复杂反应机制和电荷转移过程的体系例如氧化还原反应、催化反应等特别重要。本实验的该文件亦是通过LAMMPS官方测试用例获得。 ffield.reax.CHO是ReaxFF势能函数,它可以描述C/H/O/N等多种元素之间的相互作用。ReaxFF中包含超过用于描述键能、角势能、电子转移、三体相互作用等多种相互作用的30个参数。该文件已经被广泛应用于热化学反应和界面反应等领域。本实验的该文件是通过文献搜集进行选择获得。 input.files则是本实验的核心,input.files 通常指的是一个包含运行 LAMMPS模拟所需的多个输入文件列表的文件。在某些模拟框架或脚本中,特别是在一些自动化工具或高级使用场景中,这个文件用来统一管理和组织多个需要一起使用的输入文件。本实验将通过修改此文件来获得需要运算的参数。

代码
文本

我们需要对该文件夹中的pe.csv进行分析。

代码
文本

2. 数据清洗

在对pe.txt数据处理之前,需要进行数据处理,将数据规整化,需要进行缺失值补充等,同时需要建立数据字典。

alt

代码
文本

首先,在species.out中(本实验为了方便,统一命名为pe.txt),我们会发现出现了C 和H 标反的错误,可以通过如下代码进行互换

代码
文本
[15]
# -*- coding: utf-8 -*-
# 读取文件内容
with open('pe.txt', 'r') as file:
content = file.read()

# 将大写字母C和H互换
content = content.replace('C', 'X').replace('H', 'C').replace('X', 'H')

# 将修改后的内容写回文件
with open('pe.txt', 'w') as file:
file.write(content)

代码
文本

接下来进行数据清洗,同时建立数据字典。

代码
文本
[16]
import re
import pandas as pd

# 设置文件路径
path = ''
name = []

# 读取文件
with open(path + 'pe.txt', 'r') as f:
csv = re.split('\n', f.read())

# 总结果,dict形式{'H20':1}的列表
list_result = []
try:
for x, i in enumerate(csv):
# 取化学式字段
if x % 2 == 0:
result = re.split('\s', i)
# 取出字段中的化学式
result = [m for m in result if m != '' and m != '#']
# 放在一个总列表中,以此作为表头
name += [m for m in result if m not in name]
number = re.split('\s', csv[x + 1])
# 取组分
number = [m for m in number if m.isdigit()]
# 构建字典
list_result.append(dict(zip(result, number)))
except Exception as e:
print(f"An error occurred: {e}")

# 创建一个初始为0的列表,长度与name一致
nn = [0 for _ in range(len(name))]
ttt = []
for i in list_result:
result = dict(zip(name, nn))
for x in i.keys():
# 有则替代表中dict的值
result[x] = i[x]
ttt.append(result)

# 创建 DataFrame
df = pd.DataFrame(ttt, columns=name)

# 生成文件
df.to_csv(path + 'pe.csv', index=False)
print(df)

An error occurred: list index out of range
    Timestep No_Moles No_Specs H242C120 H189C94 H4C2 H49C24 H92C45 H89C45  \
0      10000        1        1        1       0    0      0      0      0   
1      20000        3        3        0       1    1      1      0      0   
2      30000        7        6        0       0    2      0      1      1   
3      40000       12        8        0       0    4      0      0      0   
4      50000       15        9        0       0    7      0      0      0   
..       ...      ...      ...      ...     ...  ...    ...    ...    ...   
195  1960000       69       12        0       0   28      0      0      0   
196  1970000       69       12        0       0   28      0      0      0   
197  1980000       69       12        0       0   28      0      0      0   
198  1990000       68       12        0       0   28      0      0      0   
199  2000000       68       12        0       0   28      0      0      0   

    H8C4  ... H14C8 H3C2 H8C5 H4C H2C H6C4 H5C4 H7C5 H6C5 H5C  
0      0  ...     0    0    0   0   0    0    0    0    0   0  
1      0  ...     0    0    0   0   0    0    0    0    0   0  
2      1  ...     0    0    0   0   0    0    0    0    0   0  
3      0  ...     0    0    0   0   0    0    0    0    0   0  
4      0  ...     0    0    0   0   0    0    0    0    0   0  
..   ...  ...   ...  ...  ...  ..  ..  ...  ...  ...  ...  ..  
195    1  ...     0    5    0   3   0    2    0    0    1   0  
196    1  ...     0    5    0   3   0    2    0    0    1   0  
197    1  ...     0    5    0   3   0    2    0    0    1   0  
198    1  ...     0    5    0   3   0    2    0    0    1   0  
199    1  ...     0    5    0   3   0    2    0    0    1   0  

[200 rows x 56 columns]
代码
文本

由此,我们得到了PE的规整数据文件,pe.csv,可以进入PCA环节。alt

代码
文本

3. 数据分析

接下来,我们会对pe.csv进行PCA降维和Kmeans聚类分析。

代码
文本
[19]
# -*- coding: utf-8 -*-
!pip install scikit-learn
!pip install matplotlib

from sklearn.cluster import KMeans
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# 读取数据
data = pd.read_csv('pe.csv')

# 查看数据的列名,确认 'Time' 列是否存在
print("Columns in the DataFrame:")
print(data.columns)

# 检查列是否存在,然后删除列
columns_to_drop = ['Time', 'Timestep', 'No_Moles', 'No_Specs']
columns_in_data = data.columns

for col in columns_to_drop:
if col in columns_in_data:
print(f"Dropping column: {col}")
else:
print(f"Column {col} does not exist in DataFrame")

# 删除列时忽略不存在的列
features = data.drop(columns_to_drop, axis=1, errors='ignore')

# 确认删除后的 DataFrame 列名
print("Columns after dropping specified columns:")
print(features.columns)

# 标准化数据
scaler = StandardScaler()
scaled_data = scaler.fit_transform(features)

# 创建PCA对象
pca = PCA(n_components=0.95)

# 拟合PCA模型
pca.fit(scaled_data)

# 输出主成分的具体值
print("Principal Component 1:")
print(pca.components_[0])
print("\nPrincipal Component 2:")
print(pca.components_[1])

# 输出选取的化学式名称
substances = features.columns # 获取特征列的名称
print("\nChemical Formulas for Selected Substances:")
for i, component in enumerate(pca.components_[:2]):
print("Principal Component", i + 1, ":")
component_formula = [substances[j] for j, coeff in enumerate(component) if abs(coeff) > 0]
print(component_formula)

# 数据降维
reduced_data = pca.transform(scaled_data)

# 提取主成分的名称
component_names = substances[:reduced_data.shape[1]]

# 输出每个主成分的方差解释比例
explained_variance_ratio = pca.explained_variance_ratio_
print("\nExplained Variance Ratio:", explained_variance_ratio)

# 可视化
plt.figure(figsize=(8, 6))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], alpha=0.5)

# 标注前五个物质对应的主成分
for i, txt in enumerate(component_names[:5]):
plt.annotate(txt, (reduced_data[i, 0], reduced_data[i, 1]))

# 设置X轴和Y轴标签
plt.xlabel('Principal Component 1') # 第一个主成分的化学式名称
plt.ylabel('Principal Component 2') # 第二个主成分的化学式名称

plt.title('PCA of PE Data')

# K-means聚类分析
kmeans = KMeans(n_clusters=6, random_state=42) # 设置随机种子
cluster_labels = kmeans.fit_predict(reduced_data) # 对降维后的数据进行聚类
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='red', s=100) # 标记聚类中心点

# 输出每个簇附近浓度前5的物质
for cluster in range(6): # 假设有6个簇
cluster_center = kmeans.cluster_centers_[cluster] # 获取当前簇的中心点坐标
distances = [] # 存储每个数据点到中心点的距离
for i, data_point in enumerate(reduced_data):
distance = np.linalg.norm(data_point - cluster_center) # 计算当前数据点到中心点的欧氏距离
distances.append((i, distance)) # 存储数据点索引和距离的元组
distances.sort(key=lambda x: x[1]) # 根据距离排序
top_5_indices = [idx for idx, _ in distances[:5]] # 获取距离最近的前5个数据点的索引
# 输出每个簇附近浓度前5的物质名称
print(f"Cluster {cluster + 1} top 5 substances:")
for idx in top_5_indices:
if idx < len(substances): # 确保索引不超出范围
print(substances[idx])
print()

plt.show()

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: scikit-learn in /opt/mamba/lib/python3.10/site-packages (1.5.0)
Requirement already satisfied: numpy>=1.19.5 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (3.5.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: matplotlib in /opt/mamba/lib/python3.10/site-packages (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy>=1.23 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Columns in the DataFrame:
Index(['Timestep', 'No_Moles', 'No_Specs', 'H242C120', 'H189C94', 'H4C2',
       'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36',
       'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21',
       'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19',
       'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5',
       'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8',
       'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5',
       'H6C5', 'H5C'],
      dtype='object')
Column Time does not exist in DataFrame
Dropping column: Timestep
Dropping column: No_Moles
Dropping column: No_Specs
Columns after dropping specified columns:
Index(['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4',
       'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7',
       'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5',
       'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13',
       'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4',
       'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C',
       'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C'],
      dtype='object')
Principal Component 1:
[-0.01360365 -0.01527127  0.09468784 -0.01527127 -0.01926378 -0.01926378
 -0.14348788 -0.05876032  0.053678   -0.02499621 -0.04752367 -0.08011324
 -0.0558229  -0.02499621 -0.04710345 -0.02726246 -0.11497388 -0.02964005
 -0.04150012 -0.08514119 -0.06644979 -0.02890233 -0.12139768 -0.03678431
 -0.22408498 -0.07204899 -0.10996609 -0.19883239  0.11574646 -0.06417193
 -0.05227374 -0.03768472 -0.10140824 -0.20337462 -0.1147425  -0.11645706
 -0.04013328 -0.22498154 -0.072616    0.30427574  0.2691073  -0.05556565
  0.32071994 -0.16348593  0.2930887  -0.09559369  0.30186379  0.13926064
  0.31351538  0.06627044  0.02215976  0.17463813  0.02869012]

Principal Component 2:
[-4.54994720e-02 -5.44399225e-02  3.24256036e-01 -5.44399225e-02
 -8.05108092e-02 -8.05108092e-02  2.56692841e-01 -2.87389361e-01
 -1.16099232e-01 -1.44353286e-01 -2.46550290e-01  1.90659109e-01
 -2.79928043e-01 -1.44353286e-01 -9.96689139e-02 -1.24441570e-01
  1.64646010e-01 -1.56091436e-01 -2.03327549e-01 -2.41970321e-01
 -2.68370885e-02 -1.30733746e-01 -1.51229454e-01 -7.46464272e-02
  1.71878552e-01 -1.45829373e-01 -8.63830192e-02 -4.76818695e-02
  2.77168873e-01 -1.29829722e-01 -1.08953560e-01 -2.98137197e-02
 -6.81109683e-02  5.55139025e-02 -5.67168719e-02 -2.12631362e-02
 -3.01305610e-02  1.68727101e-01  1.11645593e-01 -2.03389274e-02
  4.32810343e-03  3.02055407e-02  8.84757424e-04  1.56591547e-01
 -2.00546033e-02  1.45801072e-01 -3.72144237e-02  1.90582323e-02
 -3.21480292e-02 -3.11333977e-04 -8.69317284e-03 -3.27043738e-02
 -7.42709255e-03]

Chemical Formulas for Selected Substances:
Principal Component 1 :
['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C']
Principal Component 2 :
['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C']

Explained Variance Ratio: [0.17819562 0.12522012 0.09090798 0.06825519 0.05077275 0.04599622
 0.04262152 0.04199834 0.03961585 0.03526824 0.02835994 0.02737296
 0.02371402 0.02214529 0.02105633 0.02000413 0.01865851 0.01794748
 0.01469968 0.01314307 0.012208   0.01014002 0.00922751]
Cluster 1 top 5 substances:

Cluster 2 top 5 substances:

Cluster 3 top 5 substances:
H49C24

Cluster 4 top 5 substances:
H6C2
H25C13
H34C17
H26C13
H9C5

Cluster 5 top 5 substances:
H242C120

Cluster 6 top 5 substances:
H89C45
H8C4

代码
文本

以上是本实验的完整代码,由此可以得到的图如下 同时代码内会输出一些内容,通过这个我们可以判断PC1和PC2主成分中各物质对应的特征向量,以及聚类分析得到的簇内的物质

代码
文本

代码的输出如下,首先是综合了聚类分析的PCA图像

alt

其中标出了对实验影响最大的五种物质

接下来是聚类分析环节: 其中的红叉是聚类分析产生的簇,我们通过设置使他产生了6个簇.默认情况下,KMeans会随机选择初始聚类中心,这可能导致每次运行时聚类结果不同。本实验给出了随机种子,实验者可以自行修改后进行分析. 代码中可以看到一个簇中的不同物质分别是什么(以距离排序),如果没有代表为空簇.

alt

代码
文本

接下来,我们需要研究图内PC1,PC2的物质浓度比例来帮助我们分析化学反应中物质的变化。这可以通过热加载矩阵来更直观的体现,生成热加载矩阵的代码如下

代码
文本
[18]
# -*- coding: utf-8 -*-
!pip install seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

# 加载清洗后的数据
data = pd.read_csv('pe.csv')

# 选择特征(假设第5列及以后是特征)
X = data.iloc[:, 4:].values
features = data.columns[4:]

# 执行PCA
pca = PCA()
X_pca = pca.fit_transform(X)

# 获取解释方差比率
explained_variance_ratio = pca.explained_variance_ratio_

# 选择解释方差比率前两个主成分
n_components = 2
pca_selected = PCA(n_components=n_components)
X_pca_selected = pca_selected.fit_transform(X)

# 获取加载矩阵
loadings = pca_selected.components_.T * np.sqrt(pca_selected.explained_variance_)

# 创建加载矩阵的DataFrame
loading_matrix = pd.DataFrame(loadings, columns=[f'PC{i+1}' for i in range(n_components)], index=features)

# 绘制加载矩阵图
plt.figure(figsize=(14, 10)) # 调整图表尺寸以增加单元格间距
sns.heatmap(loading_matrix, annot=False, cmap='coolwarm', center=0, cbar_kws={"shrink": .8})
plt.title('PCA: Loading Matrix')
plt.xlabel('Principal Components')
plt.ylabel('Features')
plt.xticks(rotation=45) # 调整X轴标签的旋转角度以避免重叠
plt.yticks(rotation=0) # 保持Y轴标签水平显示
plt.show()

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: seaborn in /opt/mamba/lib/python3.10/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (2.1.4)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.2)
Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /opt/mamba/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本

最终得到的图像如下

alt

想读懂热加载矩阵图,我们需要知道,图中的颜色深度代表特征向量的大小,蓝色代表载荷为负,红色代表载荷为正.我们可以通过如下几个切入点选择其中的物质进行后续分析

  1. PC1和PC2载荷都较大的物质
  2. PC1和PC2载荷正负相反的物质
  3. 仅在PC1或PC2载荷较大的物质
代码
文本

4. Uniform Manifold Approximation and Projection(UMAP)

在PCA分析中,每个主成分都是由原始特征的线性组合构成的,得到的主成分的排列顺序是根据它们所包含的方差大小来确定的,即方差较大的主成分排在前面,方差较小的主成分排在后面。这意味着排在前面的主成分包含了数据中最显著的变化信息,而排在后面的主成分包含的变化信息逐渐减少。所以在主成分PC1,PC2中,越靠前的化学物质代表其包含的变化信息多,在本项目中的具体体现为越靠前的物质在Origin中的时间-浓度图像的变化幅度越大。本研究选取的主要研究方法为PCA降维分析,辅以UMAP作为辅助研究方法,主要对PCA结果产生的特征向量中的化学物质进行分析,同时对存在建在关联的物质进行聚类分析,同时通过UMAP方法的结果与PCA是否符合来检验PCA过程的结果。

UMAP是一种非线性降维技术,旨在将高维数据映射到低维空间,同时保持数据点之间的局部距离关系。与PCA等线性降维方法不同,UMAP能够捕捉到数据中的非线性结构和复杂关系,因此在处理高维数据集时通常具有更好的效果。UMAP的基本思想是通过优化损失函数,将高维数据映射到低维流形空间,以最大程度地保留数据点之间的局部结构和全局结构。UMAP的计算过程包括局部距离计算、高维空间中的随机游走、优化损失函数等步骤。

代码
文本
[12]

# -*- coding: utf-8 -*-
!pip install scikit-learn
!pip install matplotlib umap-learn

import pandas as pd
import matplotlib.pyplot as plt
import umap.umap_ as umap
from sklearn.cluster import KMeans

# 读取数据
data = pd.read_csv('pe.csv')

# 查看数据的列名,确认 'Time' 列是否存在
print("Columns in the DataFrame:")
print(data.columns)

# 检查列是否存在,然后删除列
columns_to_drop = ['Time', 'Timestep', 'No_Moles', 'No_Specs']
columns_in_data = data.columns

for col in columns_to_drop:
if col in columns_in_data:
print(f"Dropping column: {col}")
else:
print(f"Column {col} does not exist in DataFrame")

# 删除列时忽略不存在的列
features = data.drop(columns_to_drop, axis=1, errors='ignore')

# 确认删除后的 DataFrame 列名
print("Columns after dropping specified columns:")
print(features.columns)

# 初始化并拟合 UMAP 模型
umap_model = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(features)

# 使用 KMeans 进行聚类
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(features)
cluster_labels = kmeans.labels_

# 绘制聚类结果
plt.figure(figsize=(10, 8))

# 绘制 UMAP 降维结果
plt.subplot(1, 2, 1)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=cluster_labels, cmap='viridis', alpha=0.7)
plt.colorbar(label='Cluster Label')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.title('UMAP Visualization with Clustering')

# 添加前五个物质名称标签
for i, txt in enumerate(substances[:5]):
plt.annotate(txt, (X_umap[i, 0], X_umap[i, 1]), fontsize=8, xytext=(5, -5),
textcoords='offset points') # 调整注释的位置

# 绘制原始数据的散点图
plt.subplot(1, 2, 2)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c='blue', alpha=0.3)
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.title('UMAP Visualization of Molecule Data')

plt.show()
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: scikit-learn in /opt/mamba/lib/python3.10/site-packages (1.5.0)
Requirement already satisfied: numpy>=1.19.5 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (3.5.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: matplotlib in /opt/mamba/lib/python3.10/site-packages (3.9.0)
Requirement already satisfied: umap-learn in /opt/mamba/lib/python3.10/site-packages (0.5.6)
Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy>=1.23 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: scipy>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (1.11.4)
Requirement already satisfied: scikit-learn>=0.22 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (1.5.0)
Requirement already satisfied: numba>=0.51.2 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (0.60.0)
Requirement already satisfied: pynndescent>=0.5 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (0.5.12)
Requirement already satisfied: tqdm in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (4.64.1)
Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /opt/mamba/lib/python3.10/site-packages (from numba>=0.51.2->umap-learn) (0.43.0)
Requirement already satisfied: joblib>=0.11 in /opt/mamba/lib/python3.10/site-packages (from pynndescent>=0.5->umap-learn) (1.4.2)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn>=0.22->umap-learn) (3.5.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Columns in the DataFrame:
Index(['Timestep', 'No_Moles', 'No_Specs', 'H242C120', 'H189C94', 'H4C2',
       'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36',
       'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21',
       'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19',
       'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5',
       'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8',
       'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5',
       'H6C5', 'H5C'],
      dtype='object')
Column Time does not exist in DataFrame
Dropping column: Timestep
Dropping column: No_Moles
Dropping column: No_Specs
Columns after dropping specified columns:
Index(['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4',
       'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7',
       'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5',
       'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13',
       'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4',
       'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C',
       'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C'],
      dtype='object')
/opt/mamba/lib/python3.10/site-packages/umap/umap_.py:1945: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
代码
文本
pca
python
pcapython
已赞1
推荐阅读
公开
基于Transformer对Polymer性质进行预测
化学信息学Polymer
化学信息学Polymer
Mn
发布于 2023-11-07
2 赞1 转存文件
公开
ABACUS+pyatb:介电函数与线性光学的性质的计算
ABACUSpyatb
ABACUSpyatb
SchrodingersCat
发布于 2024-01-11