空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

NistChempy的使用

NistChempy

python

NistChempypython

CLiu

更新于 2024-11-09

推荐镜像 :Basic Image:ubuntu22.04-py3.10-irkernel-r4.4.1

推荐机型 :c2_m4_cpu

使用 NIST Chemistry WebBook 的 nistchempy 教学示例

化合物属性

初始化

属性

基本属性

参考属性

提取属性

数据提取示例

MOL 文件

光谱

[1]

!pip install nistchempy

!pip install rdkit

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting nistchempy
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/a1/d81af983832114e23a1cf6ff8ceaa345d49904a72a72762c8aac6bb55b8a/NistChemPy-1.0.2-py3-none-any.whl (10.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.6/10.6 MB 27.1 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: requests in /opt/mamba/lib/python3.10/site-packages (from nistchempy) (2.28.1)
Requirement already satisfied: beautifulsoup4 in /opt/mamba/lib/python3.10/site-packages (from nistchempy) (4.11.2)
Requirement already satisfied: pandas in /opt/mamba/lib/python3.10/site-packages (from nistchempy) (1.5.3)
Requirement already satisfied: soupsieve>1.2 in /opt/mamba/lib/python3.10/site-packages (from beautifulsoup4->nistchempy) (2.4)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandas->nistchempy) (2.8.2)
Requirement already satisfied: numpy>=1.21.0 in /opt/mamba/lib/python3.10/site-packages (from pandas->nistchempy) (1.24.2)
Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas->nistchempy) (2022.7.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/mamba/lib/python3.10/site-packages (from requests->nistchempy) (2022.9.24)
Requirement already satisfied: idna<4,>=2.5 in /opt/mamba/lib/python3.10/site-packages (from requests->nistchempy) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/mamba/lib/python3.10/site-packages (from requests->nistchempy) (1.26.11)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/mamba/lib/python3.10/site-packages (from requests->nistchempy) (2.1.1)
Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas->nistchempy) (1.16.0)
Installing collected packages: nistchempy
Successfully installed nistchempy-1.0.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting rdkit
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d2/f3/9125802d1403f56fc6d758dbec3a66fae6ad7023d396ecf5a29af27c78aa/rdkit-2024.3.6-cp310-cp310-manylinux_2_28_x86_64.whl (32.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32.8/32.8 MB 6.1 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: Pillow in /opt/mamba/lib/python3.10/site-packages (from rdkit) (10.4.0)
Requirement already satisfied: numpy in /opt/mamba/lib/python3.10/site-packages (from rdkit) (1.24.2)
Installing collected packages: rdkit
Successfully installed rdkit-2024.3.6
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

代码

文本

使用 NIST Chemistry WebBook 的 nistchempy 教学示例

本教程将展示如何使用 nistchempy 库来获取化合物的基本属性和光谱数据。我们将使用 NIST Compound ID、CAS 注册号和 InChI 字符串作为示例。

代码

文本

化合物属性

初始化

NIST Chemistry WebBook 化合物可以通过 NIST 化合物 ID、CAS 注册号或 InChI 字符串进行初始化：

代码

文本

[2]

import nistchempy as nist

X = nist.get_compound('C632053')

NistCompound(ID=C632053)

代码

文本

[3]

X = nist.get_compound('632-05-3')

NistCompound(ID=C632053)

代码

文本

双击即可修改

代码

文本

[4]

X = nist.get_compound('InChI=1S/C4H7Br3/c1-3(6)4(7)2-5/h3-4H,2H2,1H3')

NistCompound(ID=C632053)

代码

文本

如果在 NIST Chemistry WebBook 数据库中没有给定标识符的化合物，nist.get_compound 将返回 None。如果多个物质对应给定的 InChI，也会得到相同的结果。

详情可见https://webbook.nist.gov/chemistry/

代码

文本

属性

nist.compound.NistCompound 对象包含从 NIST Chemistry WebBook 的化合物网页提取的信息。可以分为三个组：

代码

文本

基本属性

ID: NIST 化合物 ID；
name: 化学名称；
synonyms: 同义词；
formula: 化学式；
mol_weight: 分子量；
inchi / inchi_key: InChI / InChIKey 字符串；
cas_rn: CAS 注册号。

代码

文本

参考属性

参考属性是字典 {属性名称 => URL}。有四个子组：

mol_refs: 分子属性，包括 2D 和 3D MOL 文件；
data_refs: WebBook 属性，存储在 NIST Chemistry WebBook 中；
nist_public_refs: 其他属性，存储在公共 NIST 网站中；
nist_subscription_refs: 其他属性，存储在付费 NIST 网站中。

代码

文本

提取属性

提取属性是从参考属性提供的 URL 中提取的属性：

mol2D / mol3D: 2D / 3D MOL 文件的文本块；
ir_specs / thz_specs / ms_specs / uv_specs: IR / THz / MS / UV 光谱的 JDX 格式文本块。

代码

文本

数据提取示例

代码

文本

[5]

s = nist.run_search('anthracene', 'name')

X = s.compounds[0]

X.__dict__

{'ID': 'C120127',
 'name': 'Anthracene',
 'synonyms': ['Anthracin',
  'Green Oil',
  'Paranaphthalene',
  'Tetra Olive N2G',
  'Anthracene oil',
  'p-Naphthalene',
  'Anthracen',
  'Coal tar pitch volatiles:anthracene',
  'Sterilite hop defoliant'],
 'formula': 'C14H10',
 'mol_weight': 178.2292,
 'inchi': 'InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H',
 'inchi_key': 'MWPLVEDNUUSJAV-UHFFFAOYSA-N',
 'cas_rn': '120-12-7',
 'mol_refs': {'mol2D': 'https://webbook.nist.gov/cgi/cbook.cgi?Str2File=C120127',
  'mol3D': 'https://webbook.nist.gov/cgi/cbook.cgi?Str3File=C120127'},
 'data_refs': {'cTG': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=1#Thermo-Gas',
  'cTC': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=2#Thermo-Condensed',
  'cTP': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=4#Thermo-Phase',
  'cTR': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=8#Thermo-React',
  'cSO': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=10#Solubility',
  'cIE': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=20#Ion-Energetics',
  'cIC': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=40#Ion-Cluster',
  'cIR': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=80#IR-Spec',
  'cMS': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=200#Mass-Spec',
  'cUV': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=400#UV-Vis-Spec',
  'cGC': 'https://webbook.nist.gov/cgi/cbook.cgi?ID=C120127&Units=SI&Mask=2000#Gas-Chrom'},
 'nist_public_refs': {'Gas Phase Kinetics Database': 'https://kinetics.nist.gov/kinetics/rpSearch?cas=120127',
  'X-ray Photoelectron Spectroscopy Database, version 5.0': 'https://srdata.nist.gov/xps/SpectralByCompdDd/21197',
  'NIST Polycyclic Aromatic Hydrocarbon Structure Index': 'https://pah.nist.gov/?q=pah015'},
 'nist_subscription_refs': {'NIST / TRC Web Thermo Tables, "lite" edition (thermophysical and thermochemical data)': 'https://wtt-lite.nist.gov/wtt-lite/index.html?cmp=anthracene',
  'NIST / TRC Web Thermo Tables, professional edition (thermophysical and thermochemical data)': 'https://wtt-pro.nist.gov/wtt-pro/index.html?cmp=anthracene'},
 'nist_response': NistResponse(ok=True, content_type='text/html; charset=UTF-8'),
 'mol2D': None,
 'mol3D': None,
 'ir_specs': [],
 'thz_specs': [],
 'ms_specs': [],
 'uv_specs': []}

代码

文本

MOL 文件

要加载 MOL 文件，可以使用 get_mol2D、get_mol3D 或 get_molfiles 方法：

代码

文本

[6]

X.get_molfiles()

代码

文本

[7]

def format_mol2d(mol2d):

# 分割字符串为行

lines = mol2d.strip().split('\r\n')

# 分隔不同部分

header = lines[0] # 第一行是分子信息

copyright_info = lines[1] # 第二行是版权信息

v2000_info = lines[2] # 第三行是 V2000 格式信息

atom_lines = []

# 查找原子坐标部分

for line in lines[3:]:

if line.startswith('M END'):

break # 到达末尾，停止添加原子坐标

atom_lines.append(line)

# 生成格式化输出

formatted_output = []

formatted_output.append("分子信息:")

formatted_output.append(f"- {header}")

formatted_output.append(f"- 版权: {copyright_info}")

formatted_output.append("\n" + "-" * 30 + "\n")

formatted_output.append("V2000格式:")

formatted_output.append(v2000_info)

formatted_output.append("\n" + "-" * 30 + "\n")

formatted_output.append("原子坐标:")

formatted_output.append(f"{len(atom_lines)} 原子数据:") # 添加原子数量说明

formatted_output.extend(atom_lines)

# 去掉键连接部分

# formatted_output.append("\n" + "-" * 30 + "\n")

# formatted_output.append("键连接:")

# formatted_output.extend(bond_lines)

formatted_output.append("\nM END")

return "\n".join(formatted_output)

# 示例数据

mol2d_example = X.mol2D

# 调用函数并打印格式化结果

formatted_result = format_mol2d(mol2d_example)

print(formatted_result)

分子信息:
- Anthracene, ID: C120127
- 版权:   NIST    24110905222D 1   1.00000     0.00000      

------------------------------

V2000格式:
Copyright by the U.S. Sec. Commerce on behalf of U.S.A. All rights reserved.

------------------------------

原子坐标:
31 原子数据:
 14 16  0     0  0              1 V2000
    0.0000    1.4838    0.0000 C   0  0  0  0  0  0           0  0  0
    0.0000    0.5117    0.0000 C   0  0  0  0  0  0           0  0  0
    0.8698    1.9955    0.0000 C   0  0  0  0  0  0           0  0  0
    0.8698    0.0000    0.0000 C   0  0  0  0  0  0           0  0  0
    1.7397    0.5117    0.0000 C   0  0  0  0  0  0           0  0  0
    1.7397    1.4838    0.0000 C   0  0  0  0  0  0           0  0  0
    2.5583    1.9955    0.0000 C   0  0  0  0  0  0           0  0  0
    2.5583    0.0000    0.0000 C   0  0  0  0  0  0           0  0  0
    3.4793    0.5117    0.0000 C   0  0  0  0  0  0           0  0  0
    3.4793    1.4838    0.0000 C   0  0  0  0  0  0           0  0  0
    4.3492    1.9955    0.0000 C   0  0  0  0  0  0           0  0  0
    4.3492    0.0000    0.0000 C   0  0  0  0  0  0           0  0  0
    5.2190    0.5117    0.0000 C   0  0  0  0  0  0           0  0  0
    5.2190    1.4838    0.0000 C   0  0  0  0  0  0           0  0  0
  2  1  2  0     0  0
  1  3  1  0     0  0
  4  2  1  0     0  0
  3  6  2  0     0  0
  5  4  2  0     0  0
  5  6  1  0     0  0
  8  5  1  0     0  0
  6  7  1  0     0  0
  7 10  2  0     0  0
  9  8  2  0     0  0
  9 10  1  0     0  0
 12  9  1  0     0  0
 10 11  1  0     0  0
 11 14  2  0     0  0
 13 12  2  0     0  0
 14 13  1  0     0  0

M  END

代码

文本

[8]

from rdkit import Chem

mol = Chem.MolFromMolBlock(X.mol2D)

mol

代码

文本

光谱

要加载光谱，可以使用 get_ir_spectra、get_thz_spectra、get_ms_spectra、get_uv_spectra 和 get_all_spectra 方法：

代码

文本

[9]

X.ir_specs, X.thz_specs, X.ms_specs, X.uv_specs

([], [], [], [])

代码

文本

[10]

X.get_ms_spectra()

X.ir_specs, X.thz_specs, X.ms_specs, X.uv_specs

([], [], [Spectrum(C120127, Mass spectrum #0)], [])

代码

文本

Spectrum 对象包含光谱的 JDX 格式文本块，包括元信息和光谱数据：

代码

文本

[11]

ms = X.ms_specs[0]

print(ms.jdx_text)

##TITLE=Anthracene
##JCAMP-DX=4.24
##DATA TYPE=MASS SPECTRUM
##ORIGIN=Japan AIST/NIMC Database- Spectrum MS-NW- 132
##OWNER=NIST Mass Spectrometry Data Center
Collection (C) 2014 copyright by the U.S. Secretary of Commerce
on behalf of the United States of America. All rights reserved.
##CAS REGISTRY NO=120-12-7
##$NIST MASS SPEC NO=228201
##MOLFORM=C14 H10
##MW=178
##$NIST SOURCE=MSDC
##XUNITS=M/Z
##YUNITS=RELATIVE INTENSITY
##XFACTOR=1
##YFACTOR=1
##FIRSTX=27
##LASTX=181
##FIRSTY=20
##MAXX=181
##MINX=27
##MAXY=9999
##MINY=10
##NPOINTS=62
##PEAK TABLE=(XY..XY)
27,20 28,10 38,30 39,109
50,129 51,129 52,30 61,40
62,129 63,289 64,20 65,20
69,20 73,10 74,219 75,299
76,619 77,80 78,10 83,50
85,30 86,99 87,169 88,439
89,759 90,10 98,119 99,90
100,50 101,50 102,60 110,40
111,50 113,60 114,20 115,50
122,40 123,20 124,20 125,50
126,149 127,60 128,80 137,30
138,30 139,209 140,80 149,70
150,419 151,629 152,689 153,80
163,50 164,20 174,129 175,199
176,1409 177,799 178,9999 179,1569
180,149 181,30
##END=

代码

文本

[29]

import matplotlib.pyplot as plt

from matplotlib import rcParams

# Set the font family to 'DejaVu Sans' which is included with Matplotlib

rcParams['font.family'] = 'DejaVu Sans'

# Rest of your code...

# Assume `ms.jdx_text` is a string containing mass spectrometry data in JCAMP-DX format

jdx_text = ms.jdx_text.splitlines() # Split the text into lines

# Find the line index where "##PEAK TABLE" starts

try:

start_index = next(i for i, line in enumerate(jdx_text) if line.startswith("##PEAK TABLE")) + 1

peak_data = jdx_text[start_index:] # Extract lines after "##PEAK TABLE"

except StopIteration:

print("No peak table data found. Please check the input file format.")

peak_data = []

# Convert data into m/z and relative intensity values

mz_values = []

intensity_values = []

# Check if peak_data is not empty

if peak_data:

for line in peak_data:

peaks = line.split() # Split multiple peaks in a line

for peak in peaks:

try:

mz, intensity = map(float, peak.split(","))

mz_values.append(mz)

intensity_values.append(intensity)

except ValueError:

if peak.strip() == "##END=":

continue # Ignore END marker

print(f"Unable to parse peak data: {peak}")

# Get maximum and minimum intensity values

if intensity_values:

max_intensity = max(intensity_values)

min_intensity = min(intensity_values)

print(f"Max Intensity: {max_intensity}")

print(f"Min Intensity: {min_intensity}")

# Plot the mass spectrum

plt.figure(figsize=(10, 6))

plt.bar(mz_values, intensity_values, width=0.5, color='blue', edgecolor='black')

plt.title("Mass Spectrum of Anthracene")

plt.xlabel("m/z (Mass-to-Charge Ratio)")

plt.ylabel("Relative Intensity")

plt.show()

else:

print("No valid intensity data found.")

else:

print("Peak table data is empty; cannot plot mass spectrum.")

Max Intensity: 9999.0
Min Intensity: 10.0

代码

文本

NistChempy

python

NistChempypython

点个赞吧