Abstract:The World Wide Web is a vast and rapidly growing repository of information. There are various kinds of objects, such as products, people, conferences, and so on, embedded in both statically and dynamically generated Web pages. Extracting the information about real-world objects is a key technique for Web mining systems. For example, the object-level search engines, such as Libra (http://libra.msra.cn) and Rexa (http://rexa.info), which help researchers find academic information like papers, conferences and researcher’s personal information, completely rely on structured Web object information. However, how to extract the object information from diverse Web pages is a challenging problem. Traditional methods are mainly template-dependent and thus not scalable to the huge number of Web pages. Furthermore, many methods are based on heuristic rules. So they are not robust enough. Recent developments in statistical machine learning make it possible to develop advanced statistical Web object extraction models. One key difference of Web object extraction from traditional information extraction from natural language text documents is that Web pages have plenty of structure information, such as two-dimensional spatial layouts and hierarchical vision tree representation. Statistical Web object extraction models can effectively leverage this information with properly designed statistical models. Another challenge of Web object extraction is that many text contents on Web pages are not regular natural language sentences. They have some structures but are lack of natural language grammars. Thus, existing natural language processing (NLP) techniques are not directly applicable. Fortunately, statistical Web object extraction models can easily merge with statistical NLP methods which have been the theme in the field of natural language processing during the last decades. Thus, the structure information on Web pages can be leveraged to help process text contents, and traditional NLP methods can be used to extract more features. Finally, the Web object extraction from diverse and large-scale Web pages provides a valuable and challenging problem for machine learning researchers. To nicely solve the problem, new learning methodology and new models (Zhu et al., 2007b) have to be developed.

How “small” Reflects “large”?—representative Information Measurement and Extraction

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

A Combined Measure for Representative Information Retrieval in Enterprise Information Systems.

A heuristic approach for λ-representative information retrieval from large-scale data

A Combined Measure for Representativeness on Information Retrieval in Web Search

Extracting representative information to enhance flexible data queries.

Finding Representative Set from Massive Data.

Finding an λ-representative subset from massive data

An Incremental Approach to Efficiently Retrieving Representative Information for Mobile Search on Web

Extracting representative information on intra-organizational blogging platforms

Statistical Web Object Extraction

Solution to Large Scale Extraction of Social Relations of Persons Based on Web

Continuously Extracting High-Quality Representative Set from Massive Data Streams.

Extending Representative Information Extraction Based on Fuzzy Classification

Efficient Entity Relation Discovery on Web

WIEAS: Helping to Discover Web Information Sources and Extract Data from Them

Estimating Collection Size in Distributed Search

Assessing the quality of information extraction

Extracting a Diverse Information Subset by Considering Information Coverage and Redundancy Simultaneously

Statistical Entity Extraction From the Web.

Web-scale extraction of structured data