Abstract:Objective: Medical laboratory testing is essential in healthcare, providing crucial data for diagnosis and treatment. Nevertheless, patients' lab testing results are often transferred via fax across healthcare organizations and are not immediately available for timely clinical decision making. Thus, it is important to develop new technologies to accurately extract lab testing information from scanned laboratory reports. This study aims to develop an advanced deep learning-based Optical Character Recognition (OCR) method to identify tables containing lab testing results in scanned laboratory reports. Methods: Extracting tabular data from scanned lab reports involves two stages: table detection (i.e., identifying the area of a table object) and table recognition (i.e., identifying and extracting tabular structures and contents). DETR R18 algorithm as well as YOLOv8s were involved for table detection, and we compared the performance of PaddleOCR and the encoder-dual-decoder (EDD) model for table recognition. 650 tables from 632 randomly selected laboratory test reports were annotated and used to train and evaluate those models. For table detection evaluation, we used metrics such as Average Precision (AP), Average Recall (AR), AP50, and AP75. For table recognition evaluation, we employed Tree-Edit Distance (TEDS). Results: For table detection, fine-tuned DETR R18 demonstrated superior performance (AP50: 0.774; AP75: 0.644; AP: 0.601; AR: 0.766). In terms of table recognition, fine-tuned EDD outperformed other models with a TEDS score of 0.815. The proposed OCR pipeline (fine-tuned DETR R18 and fine-tuned EDD), demonstrated impressive results, achieving a TEDS score of 0.699 and a TEDS structure score of 0.764. Conclusions: Our study presents a dedicated OCR pipeline for scanned clinical documents, utilizing state-of-the-art deep learning models for region-of-interest detection and table recognition. The high TEDS scores demonstrate the effectiveness of our approach, which has significant implications for clinical data analysis and decision-making.

Table Detection and Extraction using OpenCV and Novel Optimization Methods

An OpenCV-based Framework for Table Information Extraction

Table Structure Recognition using Top-Down and Bottom-Up Cues

A Conglomerate of Multiple OCR Table Detection and Extraction

TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images

Flexible Hybrid Table Recognition and Semantic Interpretation System

TableZa -- A classical Computer Vision approach to Tabular Extraction

HybridTabNet: Towards Better Table Detection in Scanned Document Images

TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content

Table of Contents Recognition in OCR Documents using Image-based Machine Learning

Table Structure Extraction with Bi-directional Gated Recurrent Unit Networks

PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

Table Detection in the Wild: A Novel Diverse Table Detection Dataset and Method

TableDet: An end-to-end deep learning approach for table detection and table image classification in data sheet images

On methods and tools of table detection, extraction and annotation in PDF documents

Page Layout Analysis for Refining Table Extraction from PDF Documents

Improving tabular data extraction in scanned laboratory reports using deep learning models

Current Status and Performance Analysis of Table Recognition in Document Images with Deep Neural Networks

TableFormer: Table Structure Understanding with Transformers