Abstract:With the rapid development of Internet technology, people have more and more access to a variety of web page resources. At the same time, the current rapid development of deep learning technology is often inseparable from the huge amount of Web data resources. On the other hand, NLP is also an important part of data processing technology, such as web page data extraction. At present, the extraction technology of web page text mainly uses a single heuristic function or strategy, and most of them need to determine the threshold manually. With the rapid growth of the number and types of web resources, there are still problems to be solved when using a single strategy to extract the text information of different pages. This paper proposes a web page text extraction algorithm based on multi-feature fusion. According to the text information characteristics of web resources, DOM nodes are used as the extraction unit to design multiple statistical features, and high-order features are designed according to heuristic strategies. This method establishes a small neural network, takes multiple features of DOM nodes as input, predicts whether the nodes contain text information, makes full use of different statistical information and extraction strategies, and adapts to more types of pages. Experimental results show that this method has a good ability of web page text extraction and avoids the problem of manually determining the threshold.

Web content extraction method based on text feature value

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web Information Segmentation Method Based on DOM Structure Tree

A Web Content Extraction Method Base on Punctuation Distribution and Html Tag Similarity

Content Extraction Method Combining Web Page Structure and Text Feature

Content Extraction of Web Pages Based on Characteristic Symbols

Combing Node Frequency and Semantic Feature for Webpage Informative Content Extraction

Extraction of Content from Web Pages Based on Magnitude of Reduction of Information Quantity

The Technology of Extracting Content Information from Web Page Based on DOM Tree

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Web Page Content Extraction Based on Multi-feature Fusion

A Statistical Approach for Content Extraction from Web Page

DOM based content extraction via text density.

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Study on General Extracting Method of Web Topic Text

Content Extraction Based on Statistic and Position Relationship Between Title and Content

Extraction of Relevant Snippets from Web Pages Using Hybrid Features.

Domain Term Extraction Method Based on Hierarchical Combination Strategy for Chinese Web Documents

Detection and Elimination of Similar Web Pages Based on Text Structure and String of Feature Code

An Approach of Information Extraction Based on Dom Tree and Weight Value