Weakly supervised learning for an effective focused web crawler
P.R. Joe Dhanith,Khalid Saeed,G. Rohith,S.P. Raja
DOI: https://doi.org/10.1016/j.engappai.2024.107944
IF: 8
2024-01-31
Engineering Applications of Artificial Intelligence
Abstract:Focused crawler traverses the Web to only collect pages that are relevant to a particular topic, and is increasingly considered as a way to get around the scalability issues with current general-purpose search engines. But the data diversity in the Web forces these crawlers to face three significant problems: (i) inconsistency, (ii) ubiquity, and (iii) ambiguity, which causes misguidance in crawling. To handle these issues, this paper proposes a weakly supervised Gated Recurrent Unit (GRU) mechanism for an adaptive focused web crawler framework that matches semantically relevant topics and webpagecontent . This weakly supervised Gated Recurrent Unit model accepts the vector form of the topic and the fetched webpage as input to produce meaningful semantic vectors and incorporates the Manhattan distance rule to compute the topical relevance of the webpage . The proposed mechanism guides the focused crawler in downloading more relevant web pages by finding the relevant hyperlinks and omitting the irrelevant hyperlinks concerning the topic. The proposed method helps the focused crawler to semantically find, arrange, and index the web pages in a relatively narrow segment of the web to solve the inconsistency, ubiquity, and ambiguity problems of the focused crawlers. The experimental results indicate that the proposed technique outperforms the state−of−the−art approaches in terms of harvestrate , precision , recall , harmonicmean , and irrelevanceratio . In summary, the strategy described here works well and is important for focused crawlers.
automation & control systems,computer science, artificial intelligence,engineering, electrical & electronic, multidisciplinary