Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的知识库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
PDF文档处理好帮手——PyMuPdf
PyMuPdf
PyMuPdf
莫凡洋
更新于 2024-11-13
推荐镜像 :mfy:02
推荐机型 :c2_m4_cpu
赞 1
1
PyMuPdf(v1)

©️ Copyright 2024 @ Authors
作者:莫凡洋
日期:2024-11-09
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。

代码
文本

🎯 本教程旨在快速掌握使用 PyMuPdf 进行PDF文档的处理。

  • 提取PDF中的文本

  • 提取PDF中的图片

代码
文本

PyMuPdf简介

PyMuPdf一个高性能的Python库,用于数据提取、分析、转换和操作PDF(和其他)文档。

官方地址:https://pypi.org/project/PyMuPDF/

代码
文本
[1]
!pip install PyMuPDF
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting PyMuPDF
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ca/21/ad66778ad2485f87ef1d5a36f17ec8d4aee8ce247c8e46c673eff776a877/PyMuPDF-1.24.11-cp38-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.6/19.6 MB 17.9 MB/s eta 0:00:0000:0100:01
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.24.11
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本

提取文本方法一:

代码
文本
[2]
# 导入PyMuPdf库
import pymupdf
doc1 = pymupdf.open("/bohr/PyMuPdf-9q7v/v1/ja7b02983_si_001.pdf") # 打开一个pdf文档,可以自己上传一个pdf,修改文件路径
doc1
Document('/personal/PDF/ja7b02983_si_001.pdf')
代码
文本
[3]
# 提取某一页的内容,例如第7页
text = doc1[6].get_text()
text
'S7\xa0\n\xa0\n4.2. \nCharacterization\xa0\nFor\xa0 simplicity,\xa0 for\xa0 amides\xa0 present\xa0 as\xa0 rotameric\xa0 mixtures\xa0 in\xa0 NMR\xa0 spectra,\xa0 only\xa0 the\xa0 major\xa0 rotamer\xa0 is\xa0\ndescribed.\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)undec‐10‐en‐1‐one\xa0(1a)\xa0\nGeneral\xa0 Procedure\xa0 A;\xa0 (Quant.).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0\naccordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[1]\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)hex‐5‐en‐1‐one\xa0(1b)\xa0\n\xa0\nGeneral\xa0Procedure\xa0B;\xa0(87%).\xa01H‐NMR\xa0(400\xa0MHz,\xa0CDCl3):\xa0δ\xa05.86–5.74\xa0(m,\xa01H),\xa05.05–4.94\xa0(m,\xa02H),\xa03.45\xa0(t,\xa0J\xa0\n=\xa06.9\xa0Hz,\xa02H),\xa03.39\xa0(t,\xa0J\xa0=\xa06.9\xa0Hz,\xa02H),\xa02.45\xa0(t,\xa0J\xa0=\xa07.6\xa0Hz,\xa02H),\xa02.15–2.08\xa0(m,\xa02H),\xa01.98–1.89\xa0(m,\xa02H),\xa01.88–\n1.80\xa0(m,\xa02H),\xa01.80–1.72\xa0(m,\xa02H);\xa013C‐NMR\xa0(100\xa0MHz,\xa0CDCl3):\xa0δ\xa0171.6,\xa0138.4,\xa0115.1,\xa046.7,\xa045.7,\xa034.1,\xa033.5,\xa0\n26.3,\xa0 24.6,\xa0 24.1;\xa0 IR\xa0 (neat) νmax:\xa0 2972,\xa0 2949,\xa0 2872,\xa0 1637,\xa0 1431,\xa0 1342,\xa0 911;\xa0 HRMS\xa0 (ESI+):\xa0 exact\xa0 mass\xa0\ncalculated\xa0for\xa0[M+Na]+\xa0(C10H17NONa)\xa0requires\xa0m/z\xa0190.1202,\xa0found\xa0m/z\xa0190.1197.\xa0\n\xa0\n4‐Phenyl‐1‐(pyrrolidin‐1‐yl)butan‐1‐one\xa0(1c)\xa0\n\xa0General\xa0Procedure\xa0A;\xa0(Quant.).\xa0All\xa0analytical\xa0data\xa0were\xa0in\xa0good\xa0accordance\xa0with\xa0\ndata\xa0reported\xa0in\xa0the\xa0literature.[2]\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)hex‐5‐yn‐1‐one\xa0(1d)\xa0\n\xa0\nGeneral\xa0Procedure\xa0B;\xa0(93%).\xa01H‐NMR\xa0(400\xa0MHz,\xa0CDCl3):\xa0δ\xa03.44\xa0(dt,\xa0J\xa0=\xa011.4,\xa06.8\xa0Hz,\xa04H),\xa02.40\xa0(t,\xa0J\xa0=\xa07.3\xa0\nHz,\xa02H),\xa02.29\xa0(td,\xa0J\xa0=\xa06.8,\xa02.6\xa0Hz,\xa02H),\xa01.99–1.81\xa0(m,\xa07H);\xa013C‐NMR\xa0(100\xa0MHz,\xa0CDCl3):\xa0δ\xa0170.9,\xa084.2,\xa068.9,\xa0\n46.7,\xa045.7,\xa033.2,\xa026.5,\xa024.5,\xa023.7,\xa018.1;\xa0IR\xa0(neat) νmax:\xa03224,\xa02970,\xa02951,\xa02873,\xa01630,\xa01437,\xa01343,\xa0633;\xa0\n'
代码
文本
[4]
# 当然你也可以写一个循环,提取某几页的内容,放到一起

result = "" # 先定义一个空字符串,用于存放提取的内容

for i in range(6, 10): # 提取第7页到第10页的内容,注意:python中的括号是“前包后不包”
text = doc1[i].get_text()
result += text + " " # 把每页所得的内容拼接,中间留个空格

result.strip() # 移除字符串收尾指定字符,默认为空格或换行符
result
'S7\xa0\n\xa0\n4.2. \nCharacterization\xa0\nFor\xa0 simplicity,\xa0 for\xa0 amides\xa0 present\xa0 as\xa0 rotameric\xa0 mixtures\xa0 in\xa0 NMR\xa0 spectra,\xa0 only\xa0 the\xa0 major\xa0 rotamer\xa0 is\xa0\ndescribed.\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)undec‐10‐en‐1‐one\xa0(1a)\xa0\nGeneral\xa0 Procedure\xa0 A;\xa0 (Quant.).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0\naccordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[1]\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)hex‐5‐en‐1‐one\xa0(1b)\xa0\n\xa0\nGeneral\xa0Procedure\xa0B;\xa0(87%).\xa01H‐NMR\xa0(400\xa0MHz,\xa0CDCl3):\xa0δ\xa05.86–5.74\xa0(m,\xa01H),\xa05.05–4.94\xa0(m,\xa02H),\xa03.45\xa0(t,\xa0J\xa0\n=\xa06.9\xa0Hz,\xa02H),\xa03.39\xa0(t,\xa0J\xa0=\xa06.9\xa0Hz,\xa02H),\xa02.45\xa0(t,\xa0J\xa0=\xa07.6\xa0Hz,\xa02H),\xa02.15–2.08\xa0(m,\xa02H),\xa01.98–1.89\xa0(m,\xa02H),\xa01.88–\n1.80\xa0(m,\xa02H),\xa01.80–1.72\xa0(m,\xa02H);\xa013C‐NMR\xa0(100\xa0MHz,\xa0CDCl3):\xa0δ\xa0171.6,\xa0138.4,\xa0115.1,\xa046.7,\xa045.7,\xa034.1,\xa033.5,\xa0\n26.3,\xa0 24.6,\xa0 24.1;\xa0 IR\xa0 (neat) νmax:\xa0 2972,\xa0 2949,\xa0 2872,\xa0 1637,\xa0 1431,\xa0 1342,\xa0 911;\xa0 HRMS\xa0 (ESI+):\xa0 exact\xa0 mass\xa0\ncalculated\xa0for\xa0[M+Na]+\xa0(C10H17NONa)\xa0requires\xa0m/z\xa0190.1202,\xa0found\xa0m/z\xa0190.1197.\xa0\n\xa0\n4‐Phenyl‐1‐(pyrrolidin‐1‐yl)butan‐1‐one\xa0(1c)\xa0\n\xa0General\xa0Procedure\xa0A;\xa0(Quant.).\xa0All\xa0analytical\xa0data\xa0were\xa0in\xa0good\xa0accordance\xa0with\xa0\ndata\xa0reported\xa0in\xa0the\xa0literature.[2]\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)hex‐5‐yn‐1‐one\xa0(1d)\xa0\n\xa0\nGeneral\xa0Procedure\xa0B;\xa0(93%).\xa01H‐NMR\xa0(400\xa0MHz,\xa0CDCl3):\xa0δ\xa03.44\xa0(dt,\xa0J\xa0=\xa011.4,\xa06.8\xa0Hz,\xa04H),\xa02.40\xa0(t,\xa0J\xa0=\xa07.3\xa0\nHz,\xa02H),\xa02.29\xa0(td,\xa0J\xa0=\xa06.8,\xa02.6\xa0Hz,\xa02H),\xa01.99–1.81\xa0(m,\xa07H);\xa013C‐NMR\xa0(100\xa0MHz,\xa0CDCl3):\xa0δ\xa0170.9,\xa084.2,\xa068.9,\xa0\n46.7,\xa045.7,\xa033.2,\xa026.5,\xa024.5,\xa023.7,\xa018.1;\xa0IR\xa0(neat) νmax:\xa03224,\xa02970,\xa02951,\xa02873,\xa01630,\xa01437,\xa01343,\xa0633;\xa0\n S8\xa0\n\xa0\nHRMS\xa0 (ESI+):\xa0 exact\xa0 mass\xa0 calculated\xa0 for\xa0 [M+Na]+\xa0 (C10H15NONa)\xa0 requires\xa0 m/z\xa0 188.1046,\xa0 found\xa0 m/z\xa0\n188.1041.\xa0\n\xa0\n6‐Chloro‐1‐(pyrrolidin‐1‐yl)hexan‐1‐one\xa0(1e)\xa0\n\xa0General\xa0Procedure\xa0A;\xa0(Quant.).\xa0All\xa0analytical\xa0data\xa0were\xa0in\xa0good\xa0accordance\xa0\nwith\xa0data\xa0reported\xa0in\xa0the\xa0literature.[3]\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)nonan‐1‐one\xa0(1f)\xa0\n\xa0 General\xa0 Procedure\xa0 A;\xa0 (Quant.).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0\naccordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[3]\xa0\n\xa0\nMethyl\xa09‐oxo‐9‐(pyrrolidin‐1‐yl)nonanoate\xa0(1g)\xa0\n\xa0 General\xa0 Procedure\xa0 B;\xa0 (95%).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0\naccordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[3]\xa0\n\xa0\n1‐(Pyrrolidin‐1‐yl)undecane‐1,10‐dione\xa0(1h)\xa0\n(48%).\xa0 Prepared\xa0 according\xa0 to\xa0 the\xa0 procedure\xa0 reported\xa0 in\xa0 the\xa0\nliterature.\xa0All\xa0analytical\xa0data\xa0were\xa0in\xa0good\xa0accordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[2]\xa0\n\xa0\n7‐Oxo‐7‐(pyrrolidin‐1‐yl)heptanenitrile\xa0(1i)\xa0\n(Quant.).\xa0Prepared\xa0according\xa0to\xa0the\xa0procedure\xa0reported\xa0in\xa0the\xa0literature.Error!\xa0\nBookmark\xa0not\xa0defined.\xa0All\xa0analytical\xa0data\xa0were\xa0in\xa0good\xa0accordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.\xa0\xa0\n\xa0\n S9\xa0\n\xa0\n1‐(Piperidin‐1‐yl)nonan‐1‐one\xa0(1j)\xa0\n\xa0 General\xa0 Procedure\xa0 A;\xa0 (Quant.).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0\naccordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[4]\xa0\n\xa0\nN,N‐Dimethylnonanamide\xa0(1k)\xa0\n\xa0General\xa0Procedure\xa0A;\xa0(Quant.).\xa0All\xa0analytical\xa0data\xa0were\xa0in\xa0good\xa0accordance\xa0\nwith\xa0data\xa0reported\xa0in\xa0the\xa0literature.[5]\xa0\n\xa0\n1‐Morpholinononan‐1‐one\xa0(1l)\xa0\n\xa0 General\xa0 Procedure\xa0 A;\xa0 (Quant.).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0\naccordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[6]\xa0\n\xa0\n2‐Phenyl‐1‐(pyrrolidin‐1‐yl)ethan‐1‐one\xa0(1m)\xa0\n\xa0General\xa0Procedure\xa0A;\xa0(Quant.).\xa0All\xa0analytical\xa0data\xa0were\xa0in\xa0good\xa0accordance\xa0with\xa0data\xa0\nreported\xa0in\xa0the\xa0literature.[2]\xa0\nN,N‐Diethylnonanamide\xa0(1n)\xa0\n\xa0 General\xa0 Procedure\xa0 A;\xa0 (Quant.).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0\naccordance\xa0with\xa0data\xa0reported\xa0in\xa0the\xa0literature.[7]\xa0\n\xa0\n\xa0\n\xa0\n\xa0\n S10\xa0\n\xa0\nN,N‐Diethyl‐3‐phenylpropanamide\xa0(1o)\xa0\n\xa0 General\xa0 Procedure\xa0 A;\xa0 (Quant.).\xa0 All\xa0 analytical\xa0 data\xa0 were\xa0 in\xa0 good\xa0 accordance\xa0 with\xa0\ndata\xa0reported\xa0in\xa0the\xa0literature.[8]\xa0\n\xa0\nN‐Allyl‐N‐benzylbutyramide\xa0(1p)\xa0\n\xa0\xa0\nGeneral\xa0Procedure\xa0A;\xa0(80%).\xa01H\xa0NMR\xa0(400\xa0MHz,\xa0CDCl3):\xa0δ\xa07.38–7.14\xa0(m,\xa05H),\xa05.83–5.67\xa0(m,\xa01H),\xa05.23–\n5.03\xa0(m,\xa02H),\xa04.59\xa0(s,\xa01H),\xa04.51\xa0(s,\xa01H),\xa04.01\xa0(d,\xa0J\xa0=\xa05.8\xa0Hz,\xa01H),\xa03.82\xa0(d,\xa0J\xa0=\xa04.9\xa0Hz,\xa01H),\xa02.34\xa0(t,\xa0J\xa0=\xa07.6\xa0Hz,\xa0\n2H),\xa07.14–6.53\xa0(m,\xa02H),\xa00.97\xa0(t,\xa0J\xa0=\xa07.4\xa0Hz,\xa02H),\xa00.93\xa0(t,\xa0J\xa0=\xa07.36\xa0Hz,\xa01H);\xa013C\xa0NMR\xa0(100\xa0MHz,\xa0CDCl3):\xa0δ\xa0\n173.6,\xa0138.1,\xa0133.4,\xa0133.0,\xa0129.1,\xa0128.7,\xa0129.4,\xa0127.7,\xa0127.5,\xa0126.5,\xa0117.5,\xa0116.9,\xa050.3,\xa049.3,\xa048.3,\xa048.1,\xa0\n35.4,\xa035.2,\xa019.0,\xa014.2;\xa0IR\xa0(neat)\xa0νmax:\xa03029,\xa02961,\xa02930,\xa02873,\xa01639,\xa01414,\xa01210,\xa0920,\xa0732,\xa0698;\xa0HRMS\xa0\n(ESI+):\xa0exact\xa0mass\xa0calculated\xa0for\xa0[M+Na]+\xa0(C14H19NONa)+\xa0requires\xa0m/z\xa0240.1365,\xa0found\xa0m/z\xa0240.1337.\xa0\n\xa0\nN‐Benzyl‐6‐chloro‐N‐methylhexanamide\xa0(1q)\xa0\n\xa0\xa0\nGeneral\xa0Procedure\xa0A;\xa0(Quant.).\xa01H\xa0NMR\xa0(400\xa0MHz,\xa0CDCl3):\xa0δ\xa07.40–7.21\xa0(m,\xa04H),\xa07.18–7.14\xa0(m,\xa01H),\xa04.59\xa0\n(s,\xa02H),\xa03.56\xa0(t,\xa0J\xa0=\xa06.7\xa0Hz,\xa02H),\xa02.92\xa0(s,\xa03H),\xa02.42–2.36\xa0(m,\xa02H),\xa01.88–1.65\xa0(m,\xa04H),\xa01.60–1.41\xa0(m,\xa02H);\xa013C\xa0\nNMR\xa0(100\xa0MHz,\xa0CDCl3):\xa0δ\xa0172.9,\xa0137.7,\xa0129.1,\xa0128.7,\xa0128.2,\xa0126.4,\xa051.0\xa0(C‐8),\xa045.1,\xa034.9,\xa033.4,\xa032.6,\xa0\n26.9,\xa024.5;\xa0IR\xa0(neat)\xa0νmax:\xa02937,\xa02866,\xa01641,\xa01452,\xa01404,\xa01356,\xa0732;\xa0HRMS\xa0(ESI+):\xa0exact\xa0mass\xa0calculated\xa0\nfor\xa0[M+Na+]+\xa0(C14H20\n35ClNONa)\xa0requires\xa0m/z\xa0276.1126,\xa0found\xa0m/z\xa0276.1124.\xa0\n\xa0\n\xa0\n\xa0\n\xa0\n '
代码
文本

这样提出来的文本会带有一些非实义字符,可以通过字符串的一些操作去除。

代码
文本
[7]
result = result.replace('\xa0', '') # 取掉无实义字符串'\xa0'
result = result.replace('\n', '') # 取掉换行符'\n'
result
'S74.2. CharacterizationFor simplicity, for amides present as rotameric mixtures in NMR spectra, only the major rotamer isdescribed.1‐(Pyrrolidin‐1‐yl)undec‐10‐en‐1‐one(1a)General Procedure A; (Quant.). All analytical data were in goodaccordancewithdatareportedintheliterature.[1]1‐(Pyrrolidin‐1‐yl)hex‐5‐en‐1‐one(1b)GeneralProcedureB;(87%).1H‐NMR(400MHz,CDCl3):δ5.86–5.74(m,1H),5.05–4.94(m,2H),3.45(t,J=6.9Hz,2H),3.39(t,J=6.9Hz,2H),2.45(t,J=7.6Hz,2H),2.15–2.08(m,2H),1.98–1.89(m,2H),1.88–1.80(m,2H),1.80–1.72(m,2H);13C‐NMR(100MHz,CDCl3):δ171.6,138.4,115.1,46.7,45.7,34.1,33.5,26.3, 24.6, 24.1; IR (neat) νmax: 2972, 2949, 2872, 1637, 1431, 1342, 911; HRMS (ESI+): exact masscalculatedfor[M+Na]+(C10H17NONa)requiresm/z190.1202,foundm/z190.1197.4‐Phenyl‐1‐(pyrrolidin‐1‐yl)butan‐1‐one(1c)GeneralProcedureA;(Quant.).Allanalyticaldatawereingoodaccordancewithdatareportedintheliterature.[2]1‐(Pyrrolidin‐1‐yl)hex‐5‐yn‐1‐one(1d)GeneralProcedureB;(93%).1H‐NMR(400MHz,CDCl3):δ3.44(dt,J=11.4,6.8Hz,4H),2.40(t,J=7.3Hz,2H),2.29(td,J=6.8,2.6Hz,2H),1.99–1.81(m,7H);13C‐NMR(100MHz,CDCl3):δ170.9,84.2,68.9,46.7,45.7,33.2,26.5,24.5,23.7,18.1;IR(neat) νmax:3224,2970,2951,2873,1630,1437,1343,633; S8HRMS (ESI+): exact mass calculated for [M+Na]+ (C10H15NONa) requires m/z 188.1046, found m/z188.1041.6‐Chloro‐1‐(pyrrolidin‐1‐yl)hexan‐1‐one(1e)GeneralProcedureA;(Quant.).Allanalyticaldatawereingoodaccordancewithdatareportedintheliterature.[3]1‐(Pyrrolidin‐1‐yl)nonan‐1‐one(1f) General Procedure A; (Quant.). All analytical data were in goodaccordancewithdatareportedintheliterature.[3]Methyl9‐oxo‐9‐(pyrrolidin‐1‐yl)nonanoate(1g) General Procedure B; (95%). All analytical data were in goodaccordancewithdatareportedintheliterature.[3]1‐(Pyrrolidin‐1‐yl)undecane‐1,10‐dione(1h)(48%). Prepared according to the procedure reported in theliterature.Allanalyticaldatawereingoodaccordancewithdatareportedintheliterature.[2]7‐Oxo‐7‐(pyrrolidin‐1‐yl)heptanenitrile(1i)(Quant.).Preparedaccordingtotheprocedurereportedintheliterature.Error!Bookmarknotdefined.Allanalyticaldatawereingoodaccordancewithdatareportedintheliterature. S91‐(Piperidin‐1‐yl)nonan‐1‐one(1j) General Procedure A; (Quant.). All analytical data were in goodaccordancewithdatareportedintheliterature.[4]N,N‐Dimethylnonanamide(1k)GeneralProcedureA;(Quant.).Allanalyticaldatawereingoodaccordancewithdatareportedintheliterature.[5]1‐Morpholinononan‐1‐one(1l) General Procedure A; (Quant.). All analytical data were in goodaccordancewithdatareportedintheliterature.[6]2‐Phenyl‐1‐(pyrrolidin‐1‐yl)ethan‐1‐one(1m)GeneralProcedureA;(Quant.).Allanalyticaldatawereingoodaccordancewithdatareportedintheliterature.[2]N,N‐Diethylnonanamide(1n) General Procedure A; (Quant.). All analytical data were in goodaccordancewithdatareportedintheliterature.[7] S10N,N‐Diethyl‐3‐phenylpropanamide(1o) General Procedure A; (Quant.). All analytical data were in good accordance withdatareportedintheliterature.[8]N‐Allyl‐N‐benzylbutyramide(1p)GeneralProcedureA;(80%).1HNMR(400MHz,CDCl3):δ7.38–7.14(m,5H),5.83–5.67(m,1H),5.23–5.03(m,2H),4.59(s,1H),4.51(s,1H),4.01(d,J=5.8Hz,1H),3.82(d,J=4.9Hz,1H),2.34(t,J=7.6Hz,2H),7.14–6.53(m,2H),0.97(t,J=7.4Hz,2H),0.93(t,J=7.36Hz,1H);13CNMR(100MHz,CDCl3):δ173.6,138.1,133.4,133.0,129.1,128.7,129.4,127.7,127.5,126.5,117.5,116.9,50.3,49.3,48.3,48.1,35.4,35.2,19.0,14.2;IR(neat)νmax:3029,2961,2930,2873,1639,1414,1210,920,732,698;HRMS(ESI+):exactmasscalculatedfor[M+Na]+(C14H19NONa)+requiresm/z240.1365,foundm/z240.1337.N‐Benzyl‐6‐chloro‐N‐methylhexanamide(1q)GeneralProcedureA;(Quant.).1HNMR(400MHz,CDCl3):δ7.40–7.21(m,4H),7.18–7.14(m,1H),4.59(s,2H),3.56(t,J=6.7Hz,2H),2.92(s,3H),2.42–2.36(m,2H),1.88–1.65(m,4H),1.60–1.41(m,2H);13CNMR(100MHz,CDCl3):δ172.9,137.7,129.1,128.7,128.2,126.4,51.0(C‐8),45.1,34.9,33.4,32.6,26.9,24.5;IR(neat)νmax:2937,2866,1641,1452,1404,1356,732;HRMS(ESI+):exactmasscalculatedfor[M+Na+]+(C14H2035ClNONa)requiresm/z276.1126,foundm/z276.1124. '
代码
文本

现在的任务是提取出这个文本中的化合物名称,例如1‐(Pyrrolidin‐1‐yl)undec‐10‐en‐1‐one。结合之前学过的chemdataextractor,我们来试一下。

代码
文本
[8]
from chemdataextractor.doc import Paragraph # 导入Paragraph
paragraph = Paragraph(result)
paragraph.sentences # 将段落划分为句子
已隐藏输出
代码
文本

注意看上面的结果,它会把英文中的句号'.'当做句子的分隔符,每遇到'.',就分成一句话。注意:不包括数字中的小数点。我们把第4句话单独拿出来看看。

代码
文本
[9]
sentence4 = paragraph.sentences[3]
sentence4
[1]1‐(Pyrrolidin‐1‐yl)hex‐5‐en‐1‐one(1b)GeneralProcedureB;(87%).1H‐NMR(400MHz,CDCl3):δ5.86–5.74(m,1H),5.05–4.94(m,2H),3.45(t,J=6.9Hz,2H),3.39(t,J=6.9Hz,2H),2.45(t,J=7.6Hz,2H),2.15–2.08(m,2H),1.98–1.89(m,2H),1.88–1.80(m,2H),1.80–1.72(m,2H);13C‐NMR(100MHz,CDCl3):δ171.6,138.4,115.1,46.7,45.7,34.1,33.5,26.3, 24.6, 24.1; IR (neat) νmax: 2972, 2949, 2872, 1637, 1431, 1342, 911; HRMS (ESI+): exact masscalculatedfor[M+Na]+(C10H17NONa)requiresm/z190.1202,foundm/z190.1197.4‐Phenyl‐1‐(pyrrolidin‐1‐yl)butan‐1‐one(1c)GeneralProcedureA;(Quant.).Allanalyticaldatawereingoodaccordancewithdatareportedintheliterature.
代码
文本
[12]
sentence4.ner_tags # 化学命名实体识别
已隐藏输出
代码
文本
[13]
# 将CNER(化学命名实体识别)与POS(词性标注)的结果合并
def get_tagged_pos_chem_tokens(sentence):
_tagged_pos_chem_tokens = sentence.pos_tagged_tokens.copy()
for i, chem_tag in enumerate(sentence.ner_tags):
if chem_tag is not None:
_tagged_pos_chem_tokens[i] = (_tagged_pos_chem_tokens[i][0], chem_tag)
return _tagged_pos_chem_tokens

tagged_pos_chem_tokens = get_tagged_pos_chem_tokens(sentence4)
tagged_pos_chem_tokens
已隐藏输出
代码
文本

观察上面的结果,化学实体命名,也即我们想找到的化合物的名称,确实已经定位,只是由于原提取的文本中,有很多空格缺失,导致多词连在一块,给提取带来麻烦。下面我们看看,通过其他方法,尽可能地保留源PDF的原样。当然,同学们可以尝试,直接在第3个代码框结果的基础上,通过字符串的操作,是否可以实现既定目标。

下面我们来试试通过PyMuPdf的另一种操作,提取PDF所有文本,甚至图片等。

代码
文本

提取文本方法二:

代码
文本
[14]
# 把一个pdf整个转换成文本文件(例如txt文件)
doc2 = pymupdf.open("/bohr/PyMuPdf-9q7v/v1/ja7b02983_si_001.pdf") # 打开一个pdf文档
out = open("/personal/s41586-024-08173-7.txt", "wb") # 创建一个文本文件,用于存放结果
for page in doc2: # 遍历pdf文档中的每一页
text = page.get_text().encode("utf8") # 提取文本 (用UTF-8编码)
out.write(text) # 把每一页的内容写到文本文件中
out.write(bytes((12,))) # 写入页面分隔符 (form feed 0x0C)
out.close()
代码
文本
[16]
# 我们看一下结果
with open('/bohr/PyMuPdf-9q7v/v1/ja7b02983_si_001.pdf', 'r') as f:
content = f.readlines()
content
已隐藏输出
代码
文本

有了这个文本文件,后面进行文本操作就容易了许多。下面我们看看如何提取pdf文档中的图片。

代码
文本

提取PDF文档中的图片

代码
文本
[20]
doc2 = pymupdf.open("/bohr/PyMuPdf-9q7v/v1/s41586-024-08173-7.pdf") # 打开一个pdf文档

for page_index in range(len(doc2)): # 遍历每一页
page = doc2[page_index] # 获取页面
image_list = page.get_images()

# 打印每一页中有几幅图片
if image_list:
print(f"Found {len(image_list)} images on page {page_index}")
else:
print("No images found on page", page_index)

for image_index, img in enumerate(image_list, start=1): # enumerate the image list
xref = img[0] # get the XREF of the image
pix = pymupdf.Pixmap(doc2, xref) # create a Pixmap

if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

pix.save("/personal/page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
pix = None
Found 2 images on page 0
No images found on page 1
No images found on page 2
No images found on page 3
No images found on page 4
Found 141 images on page 5
Found 1 images on page 6
No images found on page 7
No images found on page 8
No images found on page 9
Found 1 images on page 10
Found 1 images on page 11
Found 1 images on page 12
Found 1 images on page 13
Found 1 images on page 14
代码
文本
[25]
import matplotlib.pyplot as plt # plt 用于显示图片
import matplotlib.image as mpimg # mpimg 用于读取图片

img1 = mpimg.imread('/personal/page_10-image_1.png') # 读取图片
img2=mpimg.imread('/personal/page_12-image_1.png')

#结果展示
plt.imshow(img1)
<matplotlib.image.AxesImage at 0x7f1785be1ca0>
代码
文本

结语

关于PyMuPdf就介绍到这里,感兴趣的同学可以访问网址,解锁更多功能。

代码
文本
PyMuPdf
PyMuPdf
已赞1
{/**/}