Guest Editors' Introduction: Special Section on Mining and Searching the Web
B. Liu,Soumen Chakrabarti
DOI: https://doi.org/10.1109/TKDE.2004.1264817
IF: 9.235
IEEE Transactions on Knowledge and Data Engineering
Abstract:WITH the phenomenal growth of the Web, there is an ever-increasing volume of information being published on numerous Web sites. This vast amount of accessible information has raised many new opportunities and challenges for knowledge discovery and data engineering researchers. For programs that seek to analyze Web content, the heterogeneity in authorship and the consequent lack of structure are formidable hurdles. Discovering and extracting novel and useful knowledge from Web sources call for innovative approaches that draw from a wide range of fields spanning data mining, machine learning, statistics, databases, information retrieval, artificial intelligence, and natural language processing. In Web search, although general-purpose search engines are very useful, finding specific or targeted information can still be a frustrating experience. Highly effective, domainspecific, and personalized search techniques are not yet mainstream. In e-commerce, a whole range of online techniques are also needed to support such applications. For example, in online shopping, there are no human shop assistants to help customers. Instead, automated techniques are needed to learn from the behaviors of users in order to provide effective recommendations and assistance. Mining, extracting, and integrating Web information are challenging problems as well because there is still no mature technique to integrate information from structured (stored database), ad hoc structured (shopping sites), and unstructured (product reviews) sources. Clearly, format standards for semistructured data will not solve all of these problems. This special issue of IEEE Transactions on Knowledge and Data Engineering brings together some of the latest research results in the field. It presents seven papers which deal with a wide range of problems. All of the accepted papers propose some novel and/or principled techniques to solve these problems. Of the seven papers, three focus on domain specific and personalized Web search, one proposes a principled technique for collaborative filtering, one studies Web page cleaning for identifying informative structures and content blocks in Web pages, one studies classification of Web pages based on positive and unlabeled training examples, and one studies the clustering of XML data for efficient storage and querying of such data. The first paper by Michelangelo Diligenti, Marco Gori, and Marco Maggini studies Web page scoring for Web search and resource discovery. Current methods for the purpose are mainly based on the analysis of hyperlinks. The structure of the hyperlinks is the result of collaborative activities of the community of Web authors. Web authors usually like to link resources they consider authoritative, and authority emerges from the dynamics of popularity of the resources on the Web. This paper proposes a general probabilistic framework based on random walk of links for Web page scoring that incorporates and extends many existing models. Their results show that the proposed framework is effective and is particularly suited for focused or vertical search. The second paper by Satoshi Oyama, Takashi Kokubo, and Toru Ishida describes an interesting technique for domain specific Web search. The basic idea is to find a set of domain specific keywords (which the authors call keyword spices) that can be used as the context of the search queries in the domain. A nice algorithm based on text classification is given for identifying a reasonably complete set of such keyword spices. To perform text classification, it collects training pages from the Web through a search using an initial set of keywords of the domain. The main advantage of the proposed method is that it does not need to collect and index domain specific pages as most domain specific search engines do. The work is also related to research in query expansion and modification, but deals with a slightly different problem and offers different approaches. The third paper by Fang Liu, Clement Yu, and Weiyi Meng also studies Web search, more specifically, personalized Web search. Since general-purpose search engines do not consider user’s interests, their search results may not be interesting to a specific user. Personalized search aims at carrying out search for each user incorporating his/her interests. In this paper, the authors propose to employ a user profile and a general profile to constrain the search. The user profile is learned from the user’s search history, which contains the user interested categories and weighted terms in the categories. The general profile is built using the categories from the Open Directory Project. The key advance of the technique is that it maps each user query to some categories. At the search time, the system first uses the profiles to infer the categories of the search terms in question. Then, the search terms are augmented with each category as the context to perform search. The search results are then merged to produce a single result ranking. A comprehensive experimental evaluation is described in the paper. The fourth paper by Hung-Yu Kao, Shian-Hua Liu, JanMing Ho, and Ming-Syan Chen focuses on the cleaning of 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004