新建
表格图片中的文字抽取

我是地球人

推荐镜像 :Basic Image:ubuntu:22.04-py3.10-pytorch2.0
推荐机型 :c2_m4_cpu
赞 2
目录
©️ Copyright 2023 @ Authors
作者:
杨舒文 📨
日期:2023-10-18
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择 ubuntu:22.04-py3.10-pytorch2.0镜像 和任意配置机型即可开始。
代码
文本
Table Analyst
本notebook提供了从表格图片中抽取单元格文字信息的功能。
代码
文本
环境配置
安装PaddlePaddle的CPU版本和PaddleOCR。
对于GPU机器,可尝试安装PaddlePaddle的GPU版本:
!python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
代码
文本
[1]
!python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
!python3 -m pip install "paddleocr>=2.0.1"
!python3 -m pip install opencv-python-headless
Looking in indexes: https://mirror.baidu.com/pypi/simple Collecting paddlepaddle Downloading https://mirror.baidu.com/pypi/packages/c3/f0/0d418e741d3498d06bb3caa0b428a78c0cd805f968ae012e87e052c61b1e/paddlepaddle-2.5.1-cp310-cp310-manylinux1_x86_64.whl (124.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.9/124.9 MB 4.6 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: numpy>=1.13 in /opt/mamba/lib/python3.10/site-packages (from paddlepaddle) (1.24.2) Requirement already satisfied: decorator in /opt/mamba/lib/python3.10/site-packages (from paddlepaddle) (5.1.1) Collecting protobuf>=3.20.2 Downloading https://mirror.baidu.com/pypi/packages/c8/2c/03046cac73f46bfe98fc846ef629cf4f84c2f59258216aa2cc0d22bfca8f/protobuf-4.24.4-cp37-abi3-manylinux2014_x86_64.whl (311 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.6/311.6 kB 31.7 MB/s eta 0:00:00 Collecting paddle-bfloat==0.1.7 Downloading https://mirror.baidu.com/pypi/packages/72/cc/4e6149a9a94f0e1449686f2e5152b734bbc7ddce18ea3b240995d39fc642/paddle_bfloat-0.1.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (383 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 383.2/383.2 kB 27.8 MB/s eta 0:00:00 Collecting astor Downloading https://mirror.baidu.com/pypi/packages/c3/88/97eef84f48fa04fbd6750e62dcceafba6c63c81b7ac1420856c8dcc0a3f9/astor-0.8.1-py2.py3-none-any.whl (27 kB) Collecting httpx Downloading https://mirror.baidu.com/pypi/packages/33/0d/d9ce469af019741c8999711d36b270ff992ceb1a0293f73f9f34fdf131e9/httpx-0.25.0-py3-none-any.whl (75 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.7/75.7 kB 19.0 MB/s eta 0:00:00 Collecting Pillow Downloading https://mirror.baidu.com/pypi/packages/e5/b9/5c6ad3241f1ccca4b781dfeddbab2dac4480f95aedc351a0e60c9f4c8aa9/Pillow-10.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 10.7 MB/s eta 0:00:0000:0100:01 Collecting opt-einsum==3.3.0 Downloading https://mirror.baidu.com/pypi/packages/bc/19/404708a7e54ad2798907210462fd950c3442ea51acc8790f3da48d2bee8b/opt_einsum-3.3.0-py3-none-any.whl (65 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.5/65.5 kB 14.1 MB/s eta 0:00:00 Requirement already satisfied: idna in /opt/mamba/lib/python3.10/site-packages (from httpx->paddlepaddle) (3.4) Requirement already satisfied: sniffio in /opt/mamba/lib/python3.10/site-packages (from httpx->paddlepaddle) (1.3.0) Collecting httpcore<0.19.0,>=0.18.0 Downloading https://mirror.baidu.com/pypi/packages/ac/97/724afbb7925339f6214bf1fdb5714d1a462690466832bf8fb3fd497649f1/httpcore-0.18.0-py3-none-any.whl (76 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.0/76.0 kB 19.7 MB/s eta 0:00:00 Requirement already satisfied: certifi in /opt/mamba/lib/python3.10/site-packages (from httpx->paddlepaddle) (2022.9.24) Requirement already satisfied: anyio<5.0,>=3.0 in /opt/mamba/lib/python3.10/site-packages (from httpcore<0.19.0,>=0.18.0->httpx->paddlepaddle) (3.6.2) Collecting h11<0.15,>=0.13 Downloading https://mirror.baidu.com/pypi/packages/95/04/ff642e65ad6b90db43e668d70ffb6736436c7ce41fcc549f4e9472234127/h11-0.14.0-py3-none-any.whl (58 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 16.1 MB/s eta 0:00:00 Installing collected packages: paddle-bfloat, protobuf, Pillow, opt-einsum, h11, astor, httpcore, httpx, paddlepaddle Successfully installed Pillow-10.1.0 astor-0.8.1 h11-0.14.0 httpcore-0.18.0 httpx-0.25.0 opt-einsum-3.3.0 paddle-bfloat-0.1.7 paddlepaddle-2.5.1 protobuf-4.24.4 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting paddleocr>=2.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8f/d0/1a2f9430f61781beb16556182baa938e8f93c8b46c27ad5865a5655fae05/paddleocr-2.7.0.3-py3-none-any.whl (465 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 465.7/465.7 kB 3.8 MB/s eta 0:00:00a 0:00:01 Collecting imgaug Downloading https://pypi.tuna.tsinghua.edu.cn/packages/66/b1/af3142c4a85cba6da9f4ebb5ff4e21e2616309552caca5e8acefe9840622/imgaug-0.4.0-py2.py3-none-any.whl (948 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 948.0/948.0 kB 9.6 MB/s eta 0:00:0000:0100:01 Collecting opencv-contrib-python<=4.6.0.66 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/34/45/c8bc145b1541d1fbbf25d5494cd76453d9855971cfe571b9ad7e13cdb4c8/opencv_contrib_python-4.6.0.66-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (67.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.1/67.1 MB 11.6 MB/s eta 0:00:0000:0100:01 Collecting shapely Downloading https://pypi.tuna.tsinghua.edu.cn/packages/99/e9/a996a080d8478f4ab5ea82f64a5f39aaa8e05c99c2703e0ee03ec8c9e924/shapely-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 16.0 MB/s eta 0:00:0000:0100:01 Collecting attrdict Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ef/97/28fe7e68bc7adfce67d4339756e85e9fcf3c6fd7f0c0781695352b70472c/attrdict-2.0.1-py2.py3-none-any.whl (9.9 kB) Collecting PyMuPDF<1.21.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/57/a7/e3f2ee590766aa8c923b9580ee7dc9e2c5cec3b1f00456e66ec8a47192d0/PyMuPDF-1.20.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 20.2 MB/s eta 0:00:0000:0100:01 Collecting cython Downloading https://pypi.tuna.tsinghua.edu.cn/packages/96/87/ff035616fe86bd1b3910da8d3d1843c6ba16e84c2f970ee3cdb7549f13d1/Cython-3.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 17.9 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: Pillow>=10.0.0 in /opt/mamba/lib/python3.10/site-packages (from paddleocr>=2.0.1) (10.1.0) Collecting openpyxl Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6a/94/a59521de836ef0da54aaf50da6c4da8fb4072fb3053fa71f052fd9399e7a/openpyxl-3.1.2-py2.py3-none-any.whl (249 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.0/250.0 kB 5.4 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: numpy in /opt/mamba/lib/python3.10/site-packages (from paddleocr>=2.0.1) (1.24.2) Collecting fonttools>=4.24.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ac/ed/9a33eca5e2cc35dc1fea0a968509c653db9a99a5979656ae57c6c019d66b/fonttools-4.43.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 18.9 MB/s eta 0:00:0000:0100:01 Collecting rapidfuzz Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f3/d2/1a45332198da8ddb904ef584a79a5258dda6e59f9be0ee16540cbb0a16c1/rapidfuzz-3.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 17.7 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: pyyaml in /opt/mamba/lib/python3.10/site-packages (from paddleocr>=2.0.1) (6.0) Collecting scikit-image Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f1/6c/49f5a0ce8ddcdbdac5ac69c129654938cc6de0a936303caa6cad495ceb2a/scikit_image-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.7/14.7 MB 25.3 MB/s eta 0:00:0000:0100:01 Collecting fire>=0.3.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/94/ed/3b9a10605163f48517931083aee8364d4d6d3bb1aa9b75eb0a4a5e9fbfc1/fire-0.5.0.tar.gz (88 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.3/88.3 kB 1.9 MB/s eta 0:00:00ta 0:00:01 Preparing metadata (setup.py) ... done Collecting visualdl Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ea/b5/37726c750a4f4598660998327c3566b2d2ed5a1a5f44e9f0dde875602447/visualdl-2.5.3-py3-none-any.whl (6.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 30.8 MB/s eta 0:00:0000:0100:01 Collecting opencv-python<=4.6.0.66 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/af/bf/8d189a5c43460f6b5c8eb81ead8732e94b9f73ef8d9abba9e8f5a61a6531/opencv_python-4.6.0.66-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (60.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.9/60.9 MB 14.1 MB/s eta 0:00:0000:0100:01 Collecting premailer Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b1/07/4e8d94f94c7d41ca5ddf8a9695ad87b888104e2fd41a35546c1dc9ca74ac/premailer-3.10.0-py2.py3-none-any.whl (19 kB) Collecting pyclipper Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1c/81/4aa8403e587a4c60e00b479c11254a6e3200f3b985dcf4caecf0d8c21261/pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (908 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 908.3/908.3 kB 15.1 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: tqdm in /opt/mamba/lib/python3.10/site-packages (from paddleocr>=2.0.1) (4.64.1) Requirement already satisfied: beautifulsoup4 in /opt/mamba/lib/python3.10/site-packages (from paddleocr>=2.0.1) (4.11.2) Collecting pdf2docx Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a2/58/fbb61776641b2aae1d408d1ee712e026040ae62b949857348504dde4cecb/pdf2docx-0.5.6-py3-none-any.whl (148 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.4/148.4 kB 3.4 MB/s eta 0:00:00a 0:00:01 Collecting lxml Downloading https://pypi.tuna.tsinghua.edu.cn/packages/3c/d2/11533f0bc47ff4d828a20cfb702f3453fe714bd5b475fcdc8cec6e6b7dcf/lxml-4.9.3-cp310-cp310-manylinux_2_28_x86_64.whl (7.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 38.4 MB/s eta 0:00:0000:0100:01 Collecting python-docx Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ea/82/ddb60b44c6e39a74bd406fab7d7c102ce7dfca2dff9515dfd6edc7d25f1e/python_docx-1.0.1-py3-none-any.whl (237 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 237.4/237.4 kB 5.4 MB/s eta 0:00:00a 0:00:01 Collecting lmdb Downloading https://pypi.tuna.tsinghua.edu.cn/packages/83/67/8f32a70336d3ff1149cbd31e5a877997384f78c3940edc0abff95c8a5601/lmdb-1.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (299 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 299.2/299.2 kB 6.5 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: six in /opt/mamba/lib/python3.10/site-packages (from fire>=0.3.0->paddleocr>=2.0.1) (1.16.0) Collecting termcolor Downloading https://pypi.tuna.tsinghua.edu.cn/packages/67/e1/434566ffce04448192369c1a282931cf4ae593e91907558eaecd2e9f2801/termcolor-2.3.0-py3-none-any.whl (6.9 kB) Requirement already satisfied: soupsieve>1.2 in /opt/mamba/lib/python3.10/site-packages (from beautifulsoup4->paddleocr>=2.0.1) (2.4) Collecting matplotlib Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b5/24/aaccf324ce862bb82277e8814d2aebbb2a2c160d04e95aa2b8c9dc3137a9/matplotlib-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 32.4 MB/s eta 0:00:0000:010:01 Requirement already satisfied: scipy in /opt/mamba/lib/python3.10/site-packages (from imgaug->paddleocr>=2.0.1) (1.10.1) Collecting imageio Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/37/e21e6f38b93878ba80302e95b8ccd4718d80f0c53055ccae343e606b1e2d/imageio-2.31.5-py3-none-any.whl (313 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 313.2/313.2 kB 6.9 MB/s eta 0:00:00a 0:00:01 Collecting tifffile>=2022.8.12 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f5/72/68ea763b5f3e3d9871492683059ed4724fd700dbe54aa03cdda7a9692129/tifffile-2023.9.26-py3-none-any.whl (222 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 222.9/222.9 kB 39.3 MB/s eta 0:00:00 Requirement already satisfied: packaging>=21 in /opt/mamba/lib/python3.10/site-packages (from scikit-image->paddleocr>=2.0.1) (23.0) Requirement already satisfied: networkx>=2.8 in /opt/mamba/lib/python3.10/site-packages (from scikit-image->paddleocr>=2.0.1) (3.0) Collecting lazy_loader>=0.3 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a1/c3/65b3814e155836acacf720e5be3b5757130346670ac454fee29d3eda1381/lazy_loader-0.3-py3-none-any.whl (9.1 kB) Collecting et-xmlfile Downloading https://pypi.tuna.tsinghua.edu.cn/packages/96/c2/3dd434b0108730014f1b96fd286040dc3bcb70066346f7e01ec2ac95865f/et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB) Requirement already satisfied: typing-extensions in /opt/mamba/lib/python3.10/site-packages (from python-docx->paddleocr>=2.0.1) (4.5.0) Collecting cssutils Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c6/6e/a8ba19fe8e7a7dfaadd4597bf47f3d75a9239dd0d85870e07edeb5e803bf/cssutils-2.9.0-py3-none-any.whl (398 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 398.5/398.5 kB 8.3 MB/s eta 0:00:00a 0:00:01 Collecting cachetools Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a9/c9/c8a7710f2cedcb1db9224fdd4d8307c9e48cbddc46c18b515fefc0f1abbe/cachetools-5.3.1-py3-none-any.whl (9.3 kB) Collecting cssselect Downloading https://pypi.tuna.tsinghua.edu.cn/packages/06/a9/2da08717a6862c48f1d61ef957a7bba171e7eefa6c0aa0ceb96a140c2a6b/cssselect-1.2.0-py2.py3-none-any.whl (18 kB) Requirement already satisfied: requests in /opt/mamba/lib/python3.10/site-packages (from premailer->paddleocr>=2.0.1) (2.28.1) Collecting flask>=1.1.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/36/42/015c23096649b908c809c69388a805a571a3bea44362fe87e33fc3afa01f/flask-3.0.0-py3-none-any.whl (99 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.7/99.7 kB 2.3 MB/s eta 0:00:00ta 0:00:01 Requirement already satisfied: pandas in /opt/mamba/lib/python3.10/site-packages (from visualdl->paddleocr>=2.0.1) (1.5.3) Collecting bce-python-sdk Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4b/02/e9787f256063b21fe663b64ecb2c3cab149dc7117d24efde7addf24b5bba/bce_python_sdk-0.8.92-py3-none-any.whl (238 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.0/239.0 kB 5.1 MB/s eta 0:00:00a 0:00:01 Collecting Flask-Babel>=3.0.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/14/c2/e0ab5abe37882e118482884f2ec660cd06da644ddfbceccf5f88f546b574/flask_babel-4.0.0-py3-none-any.whl (9.6 kB) Collecting rarfile Downloading https://pypi.tuna.tsinghua.edu.cn/packages/75/34/f06b7de74bcea328d8d7a950aad099c1100578cc3960ffc5a00d30ab511c/rarfile-4.1-py3-none-any.whl (28 kB) Requirement already satisfied: psutil in /opt/mamba/lib/python3.10/site-packages (from visualdl->paddleocr>=2.0.1) (5.9.4) Requirement already satisfied: protobuf>=3.20.0 in /opt/mamba/lib/python3.10/site-packages (from visualdl->paddleocr>=2.0.1) (4.24.4) Collecting blinker>=1.6.2 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bf/2b/11bcedb7dee4923253a4a21bae3be854bcc4f06295bd827756352016d97c/blinker-1.6.3-py3-none-any.whl (13 kB) Collecting Werkzeug>=3.0.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b6/a5/54b01f663d60d5334f6c9c87c26274e94617a4fd463d812463626423b10d/werkzeug-3.0.0-py3-none-any.whl (226 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 226.6/226.6 kB 5.1 MB/s eta 0:00:00a 0:00:01 Collecting click>=8.1.3 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/2e/d53fa4befbf2cfa713304affc7ca780ce4fc1fd8710527771b58311a3229/click-8.1.7-py3-none-any.whl (97 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.9/97.9 kB 2.2 MB/s eta 0:00:00ta 0:00:01 Requirement already satisfied: Jinja2>=3.1.2 in /opt/mamba/lib/python3.10/site-packages (from flask>=1.1.1->visualdl->paddleocr>=2.0.1) (3.1.2) Collecting itsdangerous>=2.1.2 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/68/5f/447e04e828f47465eeab35b5d408b7ebaaaee207f48b7136c5a7267a30ae/itsdangerous-2.1.2-py3-none-any.whl (15 kB) Requirement already satisfied: pytz>=2022.7 in /opt/mamba/lib/python3.10/site-packages (from Flask-Babel>=3.0.0->visualdl->paddleocr>=2.0.1) (2022.7.1) Requirement already satisfied: Babel>=2.12 in /opt/mamba/lib/python3.10/site-packages (from Flask-Babel>=3.0.0->visualdl->paddleocr>=2.0.1) (2.12.1) Collecting future>=0.6.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8f/2e/cf6accf7415237d6faeeebdc7832023c90e0282aa16fd3263db0eb4715ec/future-0.18.3.tar.gz (840 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 840.9/840.9 kB 14.2 MB/s eta 0:00:00a 0:00:01 Preparing metadata (setup.py) ... done Requirement already satisfied: pycryptodome>=3.8.0 in /opt/mamba/lib/python3.10/site-packages (from bce-python-sdk->visualdl->paddleocr>=2.0.1) (3.17) Collecting kiwisolver>=1.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/6f/40/4ab1fdb57fced80ce5903f04ae1aed7c1d5939dda4fd0c0aa526c12fe28a/kiwisolver-1.4.5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 19.9 MB/s eta 0:00:0000:01 Collecting contourpy>=1.0.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f1/6b/e4b0f8708f22dd7c321f87eadbb98708975e115ac6582eb46d1f32197ce6/contourpy-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.7/301.7 kB 6.6 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib->imgaug->paddleocr>=2.0.1) (2.8.2) Collecting cycler>=0.10 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e7/05/c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6/cycler-0.12.1-py3-none-any.whl (8.3 kB) Collecting pyparsing>=2.3.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/39/92/8486ede85fcc088f1b3dba4ce92dd29d126fd96b0008ea213167940a2475/pyparsing-3.1.1-py3-none-any.whl (103 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.1/103.1 kB 2.3 MB/s eta 0:00:00a 0:00:01 Requirement already satisfied: certifi>=2017.4.17 in /opt/mamba/lib/python3.10/site-packages (from requests->premailer->paddleocr>=2.0.1) (2022.9.24) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/mamba/lib/python3.10/site-packages (from requests->premailer->paddleocr>=2.0.1) (1.26.11) Requirement already satisfied: idna<4,>=2.5 in /opt/mamba/lib/python3.10/site-packages (from requests->premailer->paddleocr>=2.0.1) (3.4) Requirement already satisfied: charset-normalizer<3,>=2 in /opt/mamba/lib/python3.10/site-packages (from requests->premailer->paddleocr>=2.0.1) (2.1.1) Requirement already satisfied: MarkupSafe>=2.0 in /opt/mamba/lib/python3.10/site-packages (from Jinja2>=3.1.2->flask>=1.1.1->visualdl->paddleocr>=2.0.1) (2.1.2) Building wheels for collected packages: fire, future Building wheel for fire (setup.py) ... done Created wheel for fire: filename=fire-0.5.0-py2.py3-none-any.whl size=116933 sha256=706d4739f486e6e53c01252ea69c5d48f360460e66f307a886328600c5ffdca2 Stored in directory: /root/.cache/pip/wheels/1f/b3/61/733f76a36386b7131a22a3eab4b92741e3ee75a9ed2a8f8460 Building wheel for future (setup.py) ... done Created wheel for future: filename=future-0.18.3-py3-none-any.whl size=492025 sha256=d1063449a6a4983c7e85acf1f089aea7779c54f4da8916d14860092adde087c7 Stored in directory: /root/.cache/pip/wheels/ae/95/01/aeff6948a97960f73473188cb01f596c0bfa838d445fc0e326 Successfully built fire future Installing collected packages: pyclipper, lmdb, Werkzeug, tifffile, termcolor, shapely, rarfile, rapidfuzz, pyparsing, PyMuPDF, opencv-python, opencv-contrib-python, lxml, lazy_loader, kiwisolver, itsdangerous, imageio, future, fonttools, et-xmlfile, cython, cycler, cssutils, cssselect, contourpy, click, cachetools, blinker, attrdict, scikit-image, python-docx, premailer, openpyxl, matplotlib, flask, fire, bce-python-sdk, pdf2docx, imgaug, Flask-Babel, visualdl, paddleocr Successfully installed Flask-Babel-4.0.0 PyMuPDF-1.20.2 Werkzeug-3.0.0 attrdict-2.0.1 bce-python-sdk-0.8.92 blinker-1.6.3 cachetools-5.3.1 click-8.1.7 contourpy-1.1.1 cssselect-1.2.0 cssutils-2.9.0 cycler-0.12.1 cython-3.0.4 et-xmlfile-1.1.0 fire-0.5.0 flask-3.0.0 fonttools-4.43.1 future-0.18.3 imageio-2.31.5 imgaug-0.4.0 itsdangerous-2.1.2 kiwisolver-1.4.5 lazy_loader-0.3 lmdb-1.4.1 lxml-4.9.3 matplotlib-3.8.0 opencv-contrib-python-4.6.0.66 opencv-python-4.6.0.66 openpyxl-3.1.2 paddleocr-2.7.0.3 pdf2docx-0.5.6 premailer-3.10.0 pyclipper-1.3.0.post5 pyparsing-3.1.1 python-docx-1.0.1 rapidfuzz-3.4.0 rarfile-4.1 scikit-image-0.22.0 shapely-2.0.2 termcolor-2.3.0 tifffile-2023.9.26 visualdl-2.5.3 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting opencv-python-headless Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9d/d7/e2aaf344254292d2046f9984b54212e4e7d69a57d30ae15e7294840710f6/opencv_python_headless-4.8.1.78-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.1/49.1 MB 10.3 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: numpy>=1.19.3 in /opt/mamba/lib/python3.10/site-packages (from opencv-python-headless) (1.24.2) Installing collected packages: opencv-python-headless Successfully installed opencv-python-headless-4.8.1.78 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本
代码
文本
[2]
import numpy as np
import logging
from typing import Tuple, List
from paddleocr import PaddleOCR
POSITION_TYPE = Tuple[float, float]
RECTANGLE_TYPE = Tuple[POSITION_TYPE, POSITION_TYPE, POSITION_TYPE, POSITION_TYPE]
PADDLE_RESULT_TYPE = Tuple[RECTANGLE_TYPE, Tuple[str, float]]
ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=False)
ocr_logger = logging.getLogger('ppocr')
ocr_logger.handlers.clear()
def paddleocr_inference(model: PaddleOCR, rgb_image: np.ndarray) -> List[PADDLE_RESULT_TYPE]:
result = model.ocr(rgb_image)[0]
return result
download https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar to /root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer/ch_PP-OCRv4_det_infer.tar 100%|██████████| 4.89M/4.89M [00:00<00:00, 49.8MiB/s]download https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_rec_infer.tar to /root/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer/ch_PP-OCRv4_rec_infer.tar 100%|██████████| 11.0M/11.0M [00:00<00:00, 70.1MiB/s] download https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar to /root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer/ch_ppocr_mobile_v2.0_cls_infer.tar 100%|██████████| 2.19M/2.19M [00:00<00:00, 29.8MiB/s][2023/10/18 15:50:17] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='/root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/root/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='/opt/mamba/lib/python3.10/site-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', use_space_char=True, vis_font_path='./doc/fonts/simfang.ttf', drop_score=0.5, e2e_algorithm='PGNet', e2e_model_dir=None, e2e_limit_side_len=768, e2e_limit_type='max', e2e_pgnet_score_thresh=0.5, e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_pgnet_valid_set='totaltext', e2e_pgnet_mode='fast', use_angle_cls=True, cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_image_shape='3, 48, 192', label_list=['0', '180'], cls_batch_num=6, cls_thresh=0.9, enable_mkldnn=False, cpu_threads=10, use_pdserving=False, warmup=False, sr_model_dir=None, sr_image_shape='3, 32, 128', sr_batch_num=1, draw_img_save_dir='./inference_results', save_crop_res=False, crop_res_save_dir='./output', use_mp=False, total_process_num=1, process_id=0, benchmark=False, save_log_path='./log_output/', show_log=True, use_onnx=False, output='./output', table_max_len=488, table_algorithm='TableAttn', table_model_dir=None, merge_no_span_structure=True, table_char_dict_path=None, layout_model_dir=None, layout_dict_path=None, layout_score_threshold=0.5, layout_nms_threshold=0.5, kie_algorithm='LayoutXLM', ser_model_dir=None, re_model_dir=None, use_visual_backbone=True, ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ocr_order_method=None, mode='structure', image_orientation=False, layout=True, table=True, ocr=True, recovery=False, use_pdf2docx_api=False, invert=False, binarize=False, alphacolor=(255, 255, 255), lang='ch', det=True, rec=True, type='ocr', ocr_version='PP-OCRv4', structure_version='PP-StructureV2')
代码
文本
构造对象
将图片并封装为Table
对象,便于单元格信息的处理和提取。
代码
文本
[3]
import cv2
import tqdm
from PIL import Image
def sort_and_merge_close(points: List[float], tolerance: float) -> List[float]:
points = sorted(points)
group_points: List[List[float]] = [[points[0]]]
for p in points[1:]:
if p - group_points[-1][-1] > tolerance:
group_points.append([p])
else:
group_points[-1].append(p)
return [sum(pts) / len(pts) for pts in group_points]
class Table:
MIN_LINE_LENGTH_RATIO = 0.5
MAX_LINE_WIDTH_RATIO = 0.01
MAX_LINE_GAP_RATIO = 0.01
PRECISION = 20
def __init__(self, image: Image.Image):
self.image = image
self.rgb_image: np.ndarray = np.asarray(image)
self.gray_image: np.ndarray = cv2.cvtColor(self.rgb_image, cv2.COLOR_RGB2GRAY)
self.width, self.height = self.image.size
self.h_lines_y, self.v_lines_x = self.detect_line(debug=True)
self.len_x = len(self.v_lines_x) - 1
self.len_y = len(self.h_lines_y) - 1
self.xy_boxes_xyxy = self.split_boxes_xyxy()
def detect_line(self, debug=False) -> Tuple[List[float], List[float]]:
def find_lines(im: np.ndarray, scale: float):
lines = cv2.HoughLinesP(
im, 1, np.pi / 180, 100,
minLineLength=self.MIN_LINE_LENGTH_RATIO * scale,
maxLineGap=self.MAX_LINE_GAP_RATIO * scale
)
if lines is None:
raise ValueError(f"No lines found.")
else:
print(len(lines), "lines found")
return lines
width, height = self.width, self.height
_, thresh_image = cv2.threshold(self.gray_image, 127, 255, cv2.THRESH_BINARY_INV)
# horizontal lines
kernel1 = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 5))
kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (int(width) // self.PRECISION, 1))
kernel3 = cv2.getStructuringElement(cv2.MORPH_RECT, (int(width) // self.PRECISION, 7))
dilated = cv2.dilate(thresh_image, kernel1, iterations=1)
eroded = cv2.erode(dilated, kernel2, iterations=1)
row_lines = cv2.dilate(eroded, kernel3, iterations=1)
# vertical lines
kernel1 = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 1))
kernel2 = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(height) // self.PRECISION))
kernel3 = cv2.getStructuringElement(cv2.MORPH_RECT, (7, int(height) // self.PRECISION))
dilated = cv2.dilate(thresh_image, kernel1, iterations=1)
eroded = cv2.erode(dilated, kernel2, iterations=1)
col_lines = cv2.dilate(eroded, kernel3, iterations=1)
if debug:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))
plt.imshow(row_lines)
plt.figure(figsize=(12, 10))
plt.imshow(col_lines)
h_lines = find_lines(row_lines, width)
h_lines_y = []
for line in h_lines:
x1, y1, x2, y2 = line[0]
if abs(y2 - y1) <= self.MAX_LINE_WIDTH_RATIO * width * 5:
h_lines_y.append((y1 + y2) / 2)
v_lines = find_lines(col_lines, height)
v_lines_x = []
for line in v_lines:
x1, y1, x2, y2 = line[0]
if abs(x2 - x1) <= self.MAX_LINE_WIDTH_RATIO * height * 5:
v_lines_x.append((x1 + x2) / 2)
if len(h_lines_y) == 0 or len(v_lines_x) == 0:
raise ValueError(f"No vertical/horizontal lines found.")
h_lines_y = sort_and_merge_close(h_lines_y, self.MAX_LINE_WIDTH_RATIO * height)
v_lines_x = sort_and_merge_close(v_lines_x, self.MAX_LINE_WIDTH_RATIO * width)
h_lines_y, v_lines_x = np.array(h_lines_y, dtype=float), np.array(v_lines_x, dtype=float)
return h_lines_y, v_lines_x
def split_boxes_xyxy(self) -> np.ndarray:
width, height = self.width, self.height
h_line_with = height * self.MAX_LINE_WIDTH_RATIO
v_line_with = width * self.MAX_LINE_WIDTH_RATIO
xy_boxes_xyxy = np.zeros([self.len_x, self.len_y, 4], dtype=float)
xy_boxes_xyxy[:, :, 0] = np.tile(self.v_lines_x[:-1], (self.len_y, 1)).T + v_line_with / 2
xy_boxes_xyxy[:, :, 1] = np.tile(self.h_lines_y[:-1], (self.len_x, 1)) + h_line_with / 2
xy_boxes_xyxy[:, :, 2] = np.tile(self.v_lines_x[1:], (self.len_y, 1)).T - v_line_with / 2
xy_boxes_xyxy[:, :, 3] = np.tile(self.h_lines_y[1:], (self.len_x, 1)) - h_line_with / 2
return xy_boxes_xyxy
def read_text(self, ocr_model: PaddleOCR, use_tqdm=False) -> List[List[str]]:
xy_text = [["" for _ in range(self.len_y)] for _ in range(self.len_x)]
t = [(i, j) for i in range(self.len_x) for j in range(self.len_y)]
if use_tqdm:
t = tqdm.tqdm(t)
for bi, bj in t:
x1, y1, x2, y2 = [int(f) for f in self.xy_boxes_xyxy[bi, bj, :]]
result = paddleocr_inference(ocr_model, self.rgb_image[y1: y2, x1: x2, :])
if result:
tuples_text_score = [r[1] for r in result if r]
xy_text[bi][bj] = " ".join([t for t, s in tuples_text_score])
else:
xy_text[bi][bj] = ""
return xy_text
代码
文本
执行
代码
文本
[4]
# load local image
# image_path = "../data/tab/25.png"
# image = Image.open(image_path)
# load online image
import requests
from io import BytesIO
image_url = "https://imageupload.io/ib/oNrba8P4J71yX8q_1697595446.png"
image = Image.open(BytesIO(requests.get(image_url).content))
代码
文本
[5]
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))
plt.imshow(image)
table = Table(image)



代码
文本
[6]
texts = table.read_text(ocr, use_tqdm=True)
print(texts)
100%|██████████| 76/76 [00:17<00:00, 4.28it/s][['XT', '16', '17', '18', '', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '1.'], ['Compound', '', '', '', 'UV10-', '[pyrid-3-yl)pyrimidine', 'dazo pyrid-3-y1)pyrimidin6', '-Acetylanilino) 2alpyrid-3-y1jpyrihiaine', '-Ethylanilino)', '', '', 'pyrid-3-y1)pyrimidine', '2alpyrid-3-yl)pyrimidine', 'pyrimidine', 'Miesylan1lino)-4-(1m -5- -ylpyliiidiie', 'pyrid-3-yl0pyrimidine', 'Methvlani11no)-4-(im1', 'e', 'pyrid-3-ylpyrimidine'], ['HPLCRet Time (mins)', '', '7.26', '8.30', '7.70', '7.39', '7.98', '7.13', '8.11', '7.47', '8.15', '7.02', '8.65', '', '6.84', '7.89', '7.65', '', ''], ['M/z [MH]*', '288', '306', '368', '306', '318', '334', '330', '316', '306', '322', '318', '394', '443', '366', '334', '302', '367', '445']]
代码
文本
已赞2
推荐阅读
公开
0. 配置环境
xuxh@dp.tech

更新于 2024-08-06
公开
6. 数据结构总览
xuxh@dp.tech

更新于 2024-08-06