Abstract:Nowadays, the size of the Internet is experiencing rapid growth. As of December 2014, the number of global Internet websites has more than 1 billion and all kinds of information resources are integrated together on the Internet, however,the search engine is to be a necessary tool for all users to retrieve useful information from vast amounts of web data. Generally speaking, a complete search engine includes the crawler system, index building systems, sorting systems and retrieval system. At present there are many open source implementation of search engine, such as lucene, solr, katta, elasticsearch, solandra and so on. The crawler system and sorting system is indispensable for any kind of search engine and in order to guarantee its efficiency, the former needs to update crawled vast amounts of data and the latter requires real-time to build index on newly crawled web pages and calculae its corresponding PageRank value. It is unlikely to accomplish such huge computation tasks depending on a single hardware implementation of the crawler system and sorting system,from which aspect, the distributed cluster technology is brought to the front. In this paper, we use the hadoop Map - Reduce computing framework to implement a distributed crawler system, and use the GraphLite, a distributed synchronous graph-computing framework, to achieve the real-time computation in getting the PageRank value of the new crawled web page.

Design and Implementation of Crawler Program Based on Python

Design and Implementation of Craweper Based on Scrapy

Implementation of Web Data Mining Technology Based on Python

Employment Data Analysis based on Python Crawler Technology

Design and Research of Web Crawler Based on Distributed Architecture

Research on Data Collection and Analysis of Second Hand House in China Based on Python

Implementation of Recruitment Website Data Analysis System Based on Web Crawler

Application of Web Crawler Technology Based on Python in Big Data Environment

[Application of Python Web Crawler Technology in Infodemiology].

Design and implementation of second-hand housing data statistical analysis system

Web Crawler: Design And Implementation For Extracting Article-Like Contents

Crawler Detection in Location-Based Services Using Attributed Action Net

Image Information Collection System Based on Python Web Crawler Technology

Innovative Application of Python in Data Crawling —Chinese Version of Movie Recommendation Platform

The Application of Web Crawler in City Image Research

Summary of web crawler technology research

Architectural Design and Evaluation of an Efficient Web-Crawling System

Implementation of Distributed Crawler System Based on Spark for Massive Data Mining

Data Crawling and Research Based on Topic Web Crawler

The Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine

Design of college student employment service platform based on cloud computing