InfoMall: A Large-Scale Storage System for Web Archiving

Lian’en Huang,Jinping Li,Xiaoming Li
DOI: https://doi.org/10.1007/978-3-642-39527-7_11
2013-01-01
Abstract:The World Wide Web is a fluid medium which means that Web pages or entire Web sites frequently change or disappear, often without leaving any trace. Considering the great value of the Web, it is quite necessary to archive the current Web for the future. In order to do this, a large-scale storage system is required. In this paper we propose such a system which is designed for storing the massive Web pages we have been collecting consistently since 2001. One significant feature of this collection of Web pages is that it is space-time dimensioned which means every Web page is attached with a URL and a time, while one URL is possible to contain lots of Web pages crawled at different times. Our system is designed that sorted Web pages are clustered and stored together by some degree of space-time granularity. As a result, users are able to retrieve effectively Web pages with URLs and times specified or batches of Web pages with URL ranges and time ranges specified.
What problem does this paper attempt to address?