Table Detection and Extraction using OpenCV and Novel Optimization Methods

Nidhi,Karandeep Saluja,Asmita Mahajan,Akash Jadhav,Nakul Aggarwal,Dharmendra Chaurasia,Debasmita Ghosh
DOI: https://doi.org/10.1109/compe53109.2021.9752204
2021-12-01
Abstract:Information Retrieval from scanned documents and images is a challenging task as there can be varying templates or document structures available. The amount of data is thus increasing exponentially leading to a rise in scanned and digital documents like PDFs, resumes, invoices, mark-sheets, medical reports, Demat receipts, etc. One of the major challenges faced is when trying to extract information from the tabular structure as it is can be uniform or non uniform in nature, meaning unequal number of rows and columns. There have been many approaches suggesting to extract tables from a digitized pdf document but none of the existing approaches have a unified approach to extract all tables from any kind of a scanned document. In this paper, the focus is on extracting information from tables with various layouts, maintaining their structure, and processing them into textual data from images. Our algorithm achieved an average accuracy of 95% tested over more than 100 images outperforming many state of the art models.
What problem does this paper attempt to address?