A Web Information Extraction Framework with Adaptive and Failure Prediction Feature

Sudhir Kumar Patnaik,C. Narendra Babu
DOI: https://doi.org/10.1145/3495008
2022-06-30
Journal of Data and Information Quality
Abstract:The amount of information available on the internet today requires effective information extraction and processing to offer hyper-personalized user experiences. Inability to extract information by using traditional and machine learning techniques due to dynamic changes in website layout pose significant challenges to the technical community to keep up with such changes. The focus of existing machine learning-based information extraction framework is only on information extraction by using core extraction logic that is susceptible to website changes, thus missing out core features such as ability to handle proactive failure prediction and intelligent information extraction capabilities. The aim of this article is to build a robust and intelligent information extraction framework with the ability not only to proactively predict website failure but also automatically extract information using deep-learning techniques using You Only Look Once and Long Short-term Memory (LSTM) networks. The proactive detection using LSTM detects new location of the web page due to layout changes and enables automatic extraction of information of the new web page. A real-world case with retail website for intelligent information extraction and an offline experimentation environment is setup to demonstrate proactive failure prediction and automatic extraction resulting in high failure prediction, precision and recall of object detection and information extraction.
What problem does this paper attempt to address?