空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

高聚物PE裂解主成分分析（PCA） ---*******

pca

python

pcapython

zzh

更新于 2024-09-23

推荐镜像 :DeePMD-kit:2024Q1-d23cf3e

推荐机型 :c2_m4_cpu

数据集

高聚物裂解(v4)

©️ Copyright 2024 @ Authors
日期：2024-06-17
共享协议：本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始：点击上方的开始连接按钮，选择 deepmd-kit:2024Q1-d23cf3e 镜像及 c2_m4_cpu 节点配置，稍等片刻即可运行。

代码

文本

通过数值模拟方法研究了聚乙烯（PE，聚合度30，单链体系）材料在高温无氧环境下的热解降解反应机理。采用ReaxFF力场，通过LAMMPS平台在无氧环境下模拟了 PE的热解降解反应。使用PCA技术对模拟数据进行降维分析，识别出关键物质在热解降解过程中的变化趋势和重要中间体。同时聚类分析显示部分中间产物之间的相似性，为进一步的反应机制研究提供理论支持。

代码

文本

Principal Component Analysis（PCA）

PCA是一种常用的线性降维技术，其主要思想是通过线性变换将原始特征转换成一组新的正交特征，称为主成分，其中每个主成分都是原始特征的线性组合。PCA的步骤包括计算数据的协方差矩阵、计算特征值和特征向量、选择主成分并投影数据。在化学物质的PCA分析中，这些主成分代表了数据中的主要方差方向，即数据中最显著的变化方向。比如在本实验对大量化学物质的处理中，主成分为所有物质按其特征向量绝对值的大小排列形成的主成分。

代码

文本

1. 数据准备

分子动力学模拟之前需要准备繁杂的前置文件。包括lmp_control文件，param.qeq文件，ffield.reax.cho力场文件，data.CHO文件，input.files文件。

代码

文本

我们已经为您准备了运行 LAMMPS 计算所需的lmp_control文件，param.qeq文件，ffield.reax.cho力场文件，data.CHO文件，input.files文件，初始数据，并将其放置在文件夹 pca 中。你可以在左侧点击数据集查看相应文件：

代码

文本

[13]

# 出于安全考虑，我们没有数据集所在文件夹的写入权限，因此我们将其复制到 `/data/` 目录下:

! cp -nr /bohr/ /personal/

# 我们在这里定义一些路径，并切换到工作路径，方便后续调用：

import os

bohr_dataset_url = "/bohr/lammps-qepa/v4/pca/" # url 可从左侧数据集复制

work_path = os.path.join("/personal", bohr_dataset_url[1:]) # 进行一个切片以删除上述路径中最开始的“/”

os.chdir(work_path)

print(f"当前路径为：{os.getcwd()}")

当前路径为：/personal/bohr/lammps-qepa/v4/pca

代码

文本

让我们来查看下载的 pca文件夹：

代码

文本

[14]

! tree /personal/bohr/lammps-qepa/v4/pca/ -L 1

/personal/bohr/lammps-qepa/v4/pca/
├── bond.CHO
├── data.CHO
├── ffield.reax.cho
├── input.pe.files
├── lmp_control
├── param.qeq
├── pe.csv
└── pe.txt

0 directories, 8 files

代码

文本

在 lammps-qepa 文件夹下有lmp_control文件，param.qeq文件，ffield.reax.cho力场文件，data.CHO文件，input.files文件,清洗前的pe.txt文件,清洗后的pe.csv文件

代码

文本

其中lmp_control 是 LAMMPS（Large-scale Atomic/Molecular Massively Parallel Simulator）计算中用于控制模拟参数和设置的输入文件。它通常用于与 ReaxFF势能函数文件配合使用，以设置模拟的各种参数，如温度，压力，时间步长，模拟时间等。本实验的该文件是通过LAMMPS官方测试用例获得。 param.qeq 是用于描述与电荷平衡（QEq，Charge Equilibration）算法相关的参数。其中有计算体系中的原子电荷分布，以更准确地模拟分子间和分子内的相互作用。这些参数可以帮助计算模拟过程中电荷转移和电荷分布，对模拟涉及复杂反应机制和电荷转移过程的体系例如氧化还原反应、催化反应等特别重要。本实验的该文件亦是通过LAMMPS官方测试用例获得。 ffield.reax.CHO是ReaxFF势能函数，它可以描述C/H/O/N等多种元素之间的相互作用。ReaxFF中包含超过用于描述键能、角势能、电子转移、三体相互作用等多种相互作用的30个参数。该文件已经被广泛应用于热化学反应和界面反应等领域。本实验的该文件是通过文献搜集进行选择获得。 input.files则是本实验的核心，input.files 通常指的是一个包含运行 LAMMPS模拟所需的多个输入文件列表的文件。在某些模拟框架或脚本中，特别是在一些自动化工具或高级使用场景中，这个文件用来统一管理和组织多个需要一起使用的输入文件。本实验将通过修改此文件来获得需要运算的参数。

代码

文本

我们需要对该文件夹中的pe.csv进行分析。

代码

文本

2. 数据清洗

在对pe.txt数据处理之前，需要进行数据处理，将数据规整化，需要进行缺失值补充等，同时需要建立数据字典。

alt

代码

文本

首先，在species.out中(本实验为了方便,统一命名为pe.txt)，我们会发现出现了C 和H 标反的错误，可以通过如下代码进行互换

代码

文本

[15]

# -*- coding: utf-8 -*-

# 读取文件内容

with open('pe.txt', 'r') as file:

content = file.read()

# 将大写字母C和H互换

content = content.replace('C', 'X').replace('H', 'C').replace('X', 'H')

# 将修改后的内容写回文件

with open('pe.txt', 'w') as file:

file.write(content)

代码

文本

接下来进行数据清洗，同时建立数据字典。

代码

文本

[16]

import re

import pandas as pd

# 设置文件路径

path = ''

name = []

# 读取文件

with open(path + 'pe.txt', 'r') as f:

csv = re.split('\n', f.read())

# 总结果，dict形式{'H20':1}的列表

list_result = []

try:

for x, i in enumerate(csv):

# 取化学式字段

if x % 2 == 0:

result = re.split('\s', i)

# 取出字段中的化学式

result = [m for m in result if m != '' and m != '#']

# 放在一个总列表中，以此作为表头

name += [m for m in result if m not in name]

number = re.split('\s', csv[x + 1])

# 取组分

number = [m for m in number if m.isdigit()]

# 构建字典

list_result.append(dict(zip(result, number)))

except Exception as e:

print(f"An error occurred: {e}")

# 创建一个初始为0的列表，长度与name一致

nn = [0 for _ in range(len(name))]

ttt = []

for i in list_result:

result = dict(zip(name, nn))

for x in i.keys():

# 有则替代表中dict的值

result[x] = i[x]

ttt.append(result)

# 创建 DataFrame

df = pd.DataFrame(ttt, columns=name)

# 生成文件

df.to_csv(path + 'pe.csv', index=False)

print(df)

An error occurred: list index out of range
    Timestep No_Moles No_Specs H242C120 H189C94 H4C2 H49C24 H92C45 H89C45  \
0      10000        1        1        1       0    0      0      0      0   
1      20000        3        3        0       1    1      1      0      0   
2      30000        7        6        0       0    2      0      1      1   
3      40000       12        8        0       0    4      0      0      0   
4      50000       15        9        0       0    7      0      0      0   
..       ...      ...      ...      ...     ...  ...    ...    ...    ...   
195  1960000       69       12        0       0   28      0      0      0   
196  1970000       69       12        0       0   28      0      0      0   
197  1980000       69       12        0       0   28      0      0      0   
198  1990000       68       12        0       0   28      0      0      0   
199  2000000       68       12        0       0   28      0      0      0   

    H8C4  ... H14C8 H3C2 H8C5 H4C H2C H6C4 H5C4 H7C5 H6C5 H5C  
0      0  ...     0    0    0   0   0    0    0    0    0   0  
1      0  ...     0    0    0   0   0    0    0    0    0   0  
2      1  ...     0    0    0   0   0    0    0    0    0   0  
3      0  ...     0    0    0   0   0    0    0    0    0   0  
4      0  ...     0    0    0   0   0    0    0    0    0   0  
..   ...  ...   ...  ...  ...  ..  ..  ...  ...  ...  ...  ..  
195    1  ...     0    5    0   3   0    2    0    0    1   0  
196    1  ...     0    5    0   3   0    2    0    0    1   0  
197    1  ...     0    5    0   3   0    2    0    0    1   0  
198    1  ...     0    5    0   3   0    2    0    0    1   0  
199    1  ...     0    5    0   3   0    2    0    0    1   0  

[200 rows x 56 columns]

代码

文本

由此，我们得到了PE的规整数据文件，pe.csv，可以进入PCA环节。 alt

代码

文本

3. 数据分析

接下来，我们会对pe.csv进行PCA降维和Kmeans聚类分析。

代码

文本

[19]

# -*- coding: utf-8 -*-

!pip install scikit-learn

!pip install matplotlib

from sklearn.cluster import KMeans

import pandas as pd

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

import numpy as np

# 读取数据

data = pd.read_csv('pe.csv')

# 查看数据的列名，确认 'Time' 列是否存在

print("Columns in the DataFrame:")

print(data.columns)

# 检查列是否存在，然后删除列

columns_to_drop = ['Time', 'Timestep', 'No_Moles', 'No_Specs']

columns_in_data = data.columns

for col in columns_to_drop:

if col in columns_in_data:

print(f"Dropping column: {col}")

else:

print(f"Column {col} does not exist in DataFrame")

# 删除列时忽略不存在的列

features = data.drop(columns_to_drop, axis=1, errors='ignore')

# 确认删除后的 DataFrame 列名

print("Columns after dropping specified columns:")

print(features.columns)

# 标准化数据

scaler = StandardScaler()

scaled_data = scaler.fit_transform(features)

# 创建PCA对象

pca = PCA(n_components=0.95)

# 拟合PCA模型

pca.fit(scaled_data)

# 输出主成分的具体值

print("Principal Component 1:")

print(pca.components_[0])

print("\nPrincipal Component 2:")

print(pca.components_[1])

# 输出选取的化学式名称

substances = features.columns # 获取特征列的名称

print("\nChemical Formulas for Selected Substances:")

for i, component in enumerate(pca.components_[:2]):

print("Principal Component", i + 1, ":")

component_formula = [substances[j] for j, coeff in enumerate(component) if abs(coeff) > 0]

print(component_formula)

# 数据降维

reduced_data = pca.transform(scaled_data)

# 提取主成分的名称

component_names = substances[:reduced_data.shape[1]]

# 输出每个主成分的方差解释比例

explained_variance_ratio = pca.explained_variance_ratio_

print("\nExplained Variance Ratio:", explained_variance_ratio)

# 可视化

plt.figure(figsize=(8, 6))

plt.scatter(reduced_data[:, 0], reduced_data[:, 1], alpha=0.5)

# 标注前五个物质对应的主成分

for i, txt in enumerate(component_names[:5]):

plt.annotate(txt, (reduced_data[i, 0], reduced_data[i, 1]))

# 设置X轴和Y轴标签

plt.xlabel('Principal Component 1') # 第一个主成分的化学式名称

plt.ylabel('Principal Component 2') # 第二个主成分的化学式名称

plt.title('PCA of PE Data')

# K-means聚类分析

kmeans = KMeans(n_clusters=6, random_state=42) # 设置随机种子

cluster_labels = kmeans.fit_predict(reduced_data) # 对降维后的数据进行聚类

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='red', s=100) # 标记聚类中心点

# 输出每个簇附近浓度前5的物质

for cluster in range(6): # 假设有6个簇

cluster_center = kmeans.cluster_centers_[cluster] # 获取当前簇的中心点坐标

distances = [] # 存储每个数据点到中心点的距离

for i, data_point in enumerate(reduced_data):

distance = np.linalg.norm(data_point - cluster_center) # 计算当前数据点到中心点的欧氏距离

distances.append((i, distance)) # 存储数据点索引和距离的元组

distances.sort(key=lambda x: x[1]) # 根据距离排序

top_5_indices = [idx for idx, _ in distances[:5]] # 获取距离最近的前5个数据点的索引

# 输出每个簇附近浓度前5的物质名称

print(f"Cluster {cluster + 1} top 5 substances:")

for idx in top_5_indices:

if idx < len(substances): # 确保索引不超出范围

print(substances[idx])

print()

plt.show()

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: scikit-learn in /opt/mamba/lib/python3.10/site-packages (1.5.0)
Requirement already satisfied: numpy>=1.19.5 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (3.5.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: matplotlib in /opt/mamba/lib/python3.10/site-packages (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy>=1.23 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Columns in the DataFrame:
Index(['Timestep', 'No_Moles', 'No_Specs', 'H242C120', 'H189C94', 'H4C2',
       'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36',
       'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21',
       'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19',
       'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5',
       'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8',
       'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5',
       'H6C5', 'H5C'],
      dtype='object')
Column Time does not exist in DataFrame
Dropping column: Timestep
Dropping column: No_Moles
Dropping column: No_Specs
Columns after dropping specified columns:
Index(['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4',
       'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7',
       'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5',
       'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13',
       'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4',
       'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C',
       'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C'],
      dtype='object')
Principal Component 1:
[-0.01360365 -0.01527127  0.09468784 -0.01527127 -0.01926378 -0.01926378
 -0.14348788 -0.05876032  0.053678   -0.02499621 -0.04752367 -0.08011324
 -0.0558229  -0.02499621 -0.04710345 -0.02726246 -0.11497388 -0.02964005
 -0.04150012 -0.08514119 -0.06644979 -0.02890233 -0.12139768 -0.03678431
 -0.22408498 -0.07204899 -0.10996609 -0.19883239  0.11574646 -0.06417193
 -0.05227374 -0.03768472 -0.10140824 -0.20337462 -0.1147425  -0.11645706
 -0.04013328 -0.22498154 -0.072616    0.30427574  0.2691073  -0.05556565
  0.32071994 -0.16348593  0.2930887  -0.09559369  0.30186379  0.13926064
  0.31351538  0.06627044  0.02215976  0.17463813  0.02869012]

Principal Component 2:
[-4.54994720e-02 -5.44399225e-02  3.24256036e-01 -5.44399225e-02
 -8.05108092e-02 -8.05108092e-02  2.56692841e-01 -2.87389361e-01
 -1.16099232e-01 -1.44353286e-01 -2.46550290e-01  1.90659109e-01
 -2.79928043e-01 -1.44353286e-01 -9.96689139e-02 -1.24441570e-01
  1.64646010e-01 -1.56091436e-01 -2.03327549e-01 -2.41970321e-01
 -2.68370885e-02 -1.30733746e-01 -1.51229454e-01 -7.46464272e-02
  1.71878552e-01 -1.45829373e-01 -8.63830192e-02 -4.76818695e-02
  2.77168873e-01 -1.29829722e-01 -1.08953560e-01 -2.98137197e-02
 -6.81109683e-02  5.55139025e-02 -5.67168719e-02 -2.12631362e-02
 -3.01305610e-02  1.68727101e-01  1.11645593e-01 -2.03389274e-02
  4.32810343e-03  3.02055407e-02  8.84757424e-04  1.56591547e-01
 -2.00546033e-02  1.45801072e-01 -3.72144237e-02  1.90582323e-02
 -3.21480292e-02 -3.11333977e-04 -8.69317284e-03 -3.27043738e-02
 -7.42709255e-03]

Chemical Formulas for Selected Substances:
Principal Component 1 :
['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C']
Principal Component 2 :
['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C']

Explained Variance Ratio: [0.17819562 0.12522012 0.09090798 0.06825519 0.05077275 0.04599622
 0.04262152 0.04199834 0.03961585 0.03526824 0.02835994 0.02737296
 0.02371402 0.02214529 0.02105633 0.02000413 0.01865851 0.01794748
 0.01469968 0.01314307 0.012208   0.01014002 0.00922751]
Cluster 1 top 5 substances:

Cluster 2 top 5 substances:

Cluster 3 top 5 substances:
H49C24

Cluster 4 top 5 substances:
H6C2
H25C13
H34C17
H26C13
H9C5

Cluster 5 top 5 substances:
H242C120

Cluster 6 top 5 substances:
H89C45
H8C4

代码

文本

以上是本实验的完整代码，由此可以得到的图如下同时代码内会输出一些内容，通过这个我们可以判断PC1和PC2主成分中各物质对应的特征向量，以及聚类分析得到的簇内的物质

代码

文本

代码的输出如下，首先是综合了聚类分析的PCA图像

alt

其中标出了对实验影响最大的五种物质

接下来是聚类分析环节: 其中的红叉是聚类分析产生的簇，我们通过设置使他产生了6个簇.默认情况下，KMeans会随机选择初始聚类中心，这可能导致每次运行时聚类结果不同。本实验给出了随机种子,实验者可以自行修改后进行分析. 代码中可以看到一个簇中的不同物质分别是什么(以距离排序)，如果没有代表为空簇.

alt

代码

文本

接下来，我们需要研究图内PC1，PC2的物质浓度比例来帮助我们分析化学反应中物质的变化。这可以通过热加载矩阵来更直观的体现，生成热加载矩阵的代码如下

代码

文本

[18]

# -*- coding: utf-8 -*-

!pip install seaborn

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.decomposition import PCA

# 加载清洗后的数据

data = pd.read_csv('pe.csv')

# 选择特征（假设第5列及以后是特征）

X = data.iloc[:, 4:].values

features = data.columns[4:]

# 执行PCA

pca = PCA()

X_pca = pca.fit_transform(X)

# 获取解释方差比率

explained_variance_ratio = pca.explained_variance_ratio_

# 选择解释方差比率前两个主成分

n_components = 2

pca_selected = PCA(n_components=n_components)

X_pca_selected = pca_selected.fit_transform(X)

# 获取加载矩阵

loadings = pca_selected.components_.T * np.sqrt(pca_selected.explained_variance_)

# 创建加载矩阵的DataFrame

loading_matrix = pd.DataFrame(loadings, columns=[f'PC{i+1}' for i in range(n_components)], index=features)

# 绘制加载矩阵图

plt.figure(figsize=(14, 10)) # 调整图表尺寸以增加单元格间距

sns.heatmap(loading_matrix, annot=False, cmap='coolwarm', center=0, cbar_kws={"shrink": .8})

plt.title('PCA: Loading Matrix')

plt.xlabel('Principal Components')

plt.ylabel('Features')

plt.xticks(rotation=45) # 调整X轴标签的旋转角度以避免重叠

plt.yticks(rotation=0) # 保持Y轴标签水平显示

plt.show()

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: seaborn in /opt/mamba/lib/python3.10/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (2.1.4)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.2)
Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /opt/mamba/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

代码

文本

最终得到的图像如下

alt

想读懂热加载矩阵图,我们需要知道,图中的颜色深度代表特征向量的大小,蓝色代表载荷为负,红色代表载荷为正.我们可以通过如下几个切入点选择其中的物质进行后续分析

PC1和PC2载荷都较大的物质
PC1和PC2载荷正负相反的物质
仅在PC1或PC2载荷较大的物质

代码

文本

4. Uniform Manifold Approximation and Projection（UMAP）

在PCA分析中，每个主成分都是由原始特征的线性组合构成的，得到的主成分的排列顺序是根据它们所包含的方差大小来确定的，即方差较大的主成分排在前面，方差较小的主成分排在后面。这意味着排在前面的主成分包含了数据中最显著的变化信息，而排在后面的主成分包含的变化信息逐渐减少。所以在主成分PC1，PC2中，越靠前的化学物质代表其包含的变化信息多，在本项目中的具体体现为越靠前的物质在Origin中的时间-浓度图像的变化幅度越大。本研究选取的主要研究方法为PCA降维分析，辅以UMAP作为辅助研究方法，主要对PCA结果产生的特征向量中的化学物质进行分析，同时对存在建在关联的物质进行聚类分析，同时通过UMAP方法的结果与PCA是否符合来检验PCA过程的结果。

UMAP是一种非线性降维技术，旨在将高维数据映射到低维空间，同时保持数据点之间的局部距离关系。与PCA等线性降维方法不同，UMAP能够捕捉到数据中的非线性结构和复杂关系，因此在处理高维数据集时通常具有更好的效果。UMAP的基本思想是通过优化损失函数，将高维数据映射到低维流形空间，以最大程度地保留数据点之间的局部结构和全局结构。UMAP的计算过程包括局部距离计算、高维空间中的随机游走、优化损失函数等步骤。

代码

文本

[12]

# -*- coding: utf-8 -*-

!pip install scikit-learn

!pip install matplotlib umap-learn

import pandas as pd

import matplotlib.pyplot as plt

import umap.umap_ as umap

from sklearn.cluster import KMeans

# 读取数据

data = pd.read_csv('pe.csv')

# 查看数据的列名，确认 'Time' 列是否存在

print("Columns in the DataFrame:")

print(data.columns)

# 检查列是否存在，然后删除列

columns_to_drop = ['Time', 'Timestep', 'No_Moles', 'No_Specs']

columns_in_data = data.columns

for col in columns_to_drop:

if col in columns_in_data:

print(f"Dropping column: {col}")

else:

print(f"Column {col} does not exist in DataFrame")

# 删除列时忽略不存在的列

features = data.drop(columns_to_drop, axis=1, errors='ignore')

# 确认删除后的 DataFrame 列名

print("Columns after dropping specified columns:")

print(features.columns)

# 初始化并拟合 UMAP 模型

umap_model = umap.UMAP(n_components=2, random_state=42)

X_umap = umap_model.fit_transform(features)

# 使用 KMeans 进行聚类

kmeans = KMeans(n_clusters=5, random_state=42)

kmeans.fit(features)

cluster_labels = kmeans.labels_

# 绘制聚类结果

plt.figure(figsize=(10, 8))

# 绘制 UMAP 降维结果

plt.subplot(1, 2, 1)

plt.scatter(X_umap[:, 0], X_umap[:, 1], c=cluster_labels, cmap='viridis', alpha=0.7)

plt.colorbar(label='Cluster Label')

plt.xlabel('UMAP 1')

plt.ylabel('UMAP 2')

plt.title('UMAP Visualization with Clustering')

# 添加前五个物质名称标签

for i, txt in enumerate(substances[:5]):

plt.annotate(txt, (X_umap[i, 0], X_umap[i, 1]), fontsize=8, xytext=(5, -5),

textcoords='offset points') # 调整注释的位置

# 绘制原始数据的散点图

plt.subplot(1, 2, 2)

plt.scatter(X_umap[:, 0], X_umap[:, 1], c='blue', alpha=0.3)

plt.xlabel('UMAP 1')

plt.ylabel('UMAP 2')

plt.title('UMAP Visualization of Molecule Data')

plt.show()

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: scikit-learn in /opt/mamba/lib/python3.10/site-packages (1.5.0)
Requirement already satisfied: numpy>=1.19.5 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.2.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (3.5.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: matplotlib in /opt/mamba/lib/python3.10/site-packages (3.9.0)
Requirement already satisfied: umap-learn in /opt/mamba/lib/python3.10/site-packages (0.5.6)
Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy>=1.23 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: scipy>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (1.11.4)
Requirement already satisfied: scikit-learn>=0.22 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (1.5.0)
Requirement already satisfied: numba>=0.51.2 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (0.60.0)
Requirement already satisfied: pynndescent>=0.5 in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (0.5.12)
Requirement already satisfied: tqdm in /opt/mamba/lib/python3.10/site-packages (from umap-learn) (4.64.1)
Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /opt/mamba/lib/python3.10/site-packages (from numba>=0.51.2->umap-learn) (0.43.0)
Requirement already satisfied: joblib>=0.11 in /opt/mamba/lib/python3.10/site-packages (from pynndescent>=0.5->umap-learn) (1.4.2)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn>=0.22->umap-learn) (3.5.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Columns in the DataFrame:
Index(['Timestep', 'No_Moles', 'No_Specs', 'H242C120', 'H189C94', 'H4C2',
       'H49C24', 'H92C45', 'H89C45', 'H8C4', 'H44C22', 'H', 'H19C9', 'H73C36',
       'H5C3', 'H16C8', 'H67C34', 'H15C7', 'H55C28', 'H5C2', 'H11C5', 'H43C21',
       'H13C7', 'H7C3', 'H69C34', 'H10C5', 'H59C29', 'H6C3', 'H38C19',
       'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13', 'H34C17', 'H26C13', 'H9C5',
       'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4', 'H4C3', 'H2C2', 'H15C8',
       'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C', 'H6C4', 'H5C4', 'H7C5',
       'H6C5', 'H5C'],
      dtype='object')
Column Time does not exist in DataFrame
Dropping column: Timestep
Dropping column: No_Moles
Dropping column: No_Specs
Columns after dropping specified columns:
Index(['H242C120', 'H189C94', 'H4C2', 'H49C24', 'H92C45', 'H89C45', 'H8C4',
       'H44C22', 'H', 'H19C9', 'H73C36', 'H5C3', 'H16C8', 'H67C34', 'H15C7',
       'H55C28', 'H5C2', 'H11C5', 'H43C21', 'H13C7', 'H7C3', 'H69C34', 'H10C5',
       'H59C29', 'H6C3', 'H38C19', 'H29C15', 'H9C4', 'H3C', 'H6C2', 'H25C13',
       'H34C17', 'H26C13', 'H9C5', 'H30C15', 'H11C6', 'H18C9', 'H10C4', 'H7C4',
       'H4C3', 'H2C2', 'H15C8', 'H2', 'H14C8', 'H3C2', 'H8C5', 'H4C', 'H2C',
       'H6C4', 'H5C4', 'H7C5', 'H6C5', 'H5C'],
      dtype='object')
/opt/mamba/lib/python3.10/site-packages/umap/umap_.py:1945: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")

代码

文本

pca

python

pcapython

已赞1