Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques

Karthikeyan T.,Karthik Sekaran,Ranjith D.,Vinoth kumar,Balajee J M,Vinoth Kumar V.
DOI: https://doi.org/10.4018/ijwp.2019070103
2019-07-01
International Journal of Web Portals
Abstract:Web scraping is a technique to extract information from various web documents automatically. It retrieves the related contents based on the query, aggregates and transforms the data from an unstructured format into a structured representation. Text classification becomes a vital phase to summarize the data and in categorizing the webpages adequately. In this article, using effective web scraping methodologies, the data is initially extracted from websites, then transformed into a structured form. Based on the keywords from the data, the documents are classified and labeled. A recursive feature elimination technique is applied to the data to select the best candidate feature subset. The final data-set trained with standard machine learning algorithms. The proposed model performs well on classifying the documents from the extracted data with a better accuracy rate.
What problem does this paper attempt to address?