Abstract:Although web pages are rich in resources, they are usually intertwined with advertisements, banners, navigation bars, footer copyrights and other templates, which are often not of interest to users. In this paper, we study the problem of extracting the main content and removing irrelevant information from web pages. The common solution is to classify each web component into boilerplate (noise) or main content. State-of-the-art approaches such as BoilerNet use neural sequence labeling to achieve an impressive score in CleanEval EN dataset. However, the model uses only the top 50 HTML tags as input features, which does not fully utilize the power of tag information. In addition, the most frequent 1,000 words used for text content representation cannot effectively support a real-world environment in which web pages appear in multiple languages. In this paper, we propose a multi-task learning framework based on two auxiliary tasks: depth prediction and position prediction. We explore HTML tag embedding for tag path representation learning. Further, we employ multilingual Bidirectional Encoder Representations from Transformers (BERT) for text content representation to deal with any web pages without language limitations. The experiments show that HTML tag embedding and multi-task learning frameworks achieve much higher scores than using BoilerNet on CleanEval EN datasets. Secondly, the pre-trained text block representation based on multilingual BERT will degrade the performance on EN test sets; however, zero-shot experiments on three languages (Chinese, Japanese, and Thai) have a performance consistent with the five-fold cross-validation of the respective language, which indicates the possibility of providing cross-lingual support in one model.

Cleaner Pretraining Corpus Curation with Neural Web Scraping

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

Automatically Building Large-Scale Named Entity Recognition Corpora from Chinese Wikipedia

Preprocessing and Feature Preparation in Chinese Web Page Classification

A HTML Parser to Improve Chinese Search Engines

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal

Unsupervised Parallel Corpus Mining on Web Data

CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

The Automatic Classification Of Web Pages Based On Neural Network

Construction of English Resume Corpus and Test with Pre-trained Language Models

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Web Page Content Extraction Based on Multi-feature Fusion

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

Web Crawler: Design And Implementation For Extracting Article-Like Contents