Abstract:AbstractWeb mining is an emerging research area due to the rapid growth of websites. Web mining is classified into Web Content Mining (WCM), Web Usage Mining and Web Structure Mining. Extraction of required information from web page content available on World Wide Web (WWW) is WCM. The WCM is further classified into two categories first category is to directly mine the content on documents and second category is to mine the content using search engine. The mining method focuses on the information extraction and integration. The content of Web may be text, image, audio, video. Web pages typically contain a large amount of information that is not part of the main contents of the pages, like banner advertisements, navigation bars, copyright notices, etc. Such noises on Web pages usually lead to poor results in Web mining. This paper focuses on the problem of Noise free Information retrieval on web pages, which means the pre-processing of Web pages automatically to detect and eliminate noises. This paper proposes an approach for eliminating noises from web pages for the purpose of improving the accuracy and efficiency of web content mining. The main objective of removing noise from a Web Page is to improve the performance of the search. It is very essential to differentiate important information from noisy content that may misguide users’ interest. This approach mainly concentrates on removing the following noises in stages: (1) Primary noises-Navigation bars, Panels and Frames, Page Headers and Footers, Copyright and Privacy Notices, Advertisements and other Uninteresting Data such as audio, video, multiple links. (2) Duplicate Contents and (3) Noise Contents according to block importance. The removal of these noises is done by performing three operations. Firstly, using the Block Splitting operation, primary noises are removed and only the useful text contents are partitioned into blocks. Secondly, using simhash algorithm, the duplicate blocks are removed to obtain the distinct blocks. For each block, three parameters namely Keyword Redundancy (KR), Linkword Percentage (LP) and Titleword Relevancy (TR) calculated. Using these three parameters block importance value (BI) is calculated, which is called Simhash algorithm. The importance of the block is then calculated using simhash algorithm. Based on a threshold value the important blocks are selected using sketching algorithm and the keywords are extracted from those important blocks.

Effectual Web Content Mining using Noise Removal from Web Pages

Noise Elimination Method in Web Pages Based on the Similarity of Same Layer Pages

Web Data Mining with Organized Contents Using Naive Bayes Algorithm

Overview of Web Content Mining Tools

Using XPath to Discover Informative Content Blocks of Web Pages

A hybrid approach for content extraction with text density and visual importance of DOM nodes

A Survey on Preprocessing Methods for Web Usage Data

Duplicate Web Page Elimination Based on Bloom Filter

Web Documents Mining

A Web Site Mining Algorithm Using the Multiscale Tree Representation Model

Frameworks for Web Usage Mining

A language independent web data extraction using vision based page segmentation algorithm

Page-Level Main Content Extraction From Heterogeneous Webpages

Learning Block Importance Models for Web Pages

Chinese web page content extraction based on page content analysis

Sources of Noise in Interactive Information Search

Similarity based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining

An Efficient Information Extraction Mechanism with Page Ranking and a Classification Strategy based on Similarity Learning of Web Text Documents

Web Usage mining framework for Data Cleaning and IP address Identification

Optimizing web search using web click-through data.

Learning Important Models for Web Page Blocks Based on Layout and Content Analysis