Design and Analysis of a Report Tracing System Based on Webinfomall

HUANG Lian-en,LI Xiao-ming
DOI: https://doi.org/10.3969/j.issn.1007-130x.2008.02.001
2008-01-01
Abstract:Webinfomall is a Chinese web archive developed at Peking University since 2001.As of today,it has accumulated about three billion Chinese web pages since early 2002,and is increasing in volume at the rate of one to two million pages a day.Providing an effective information mining system over Webinfomall is a basic challenge we would like to take.In this article,we describe a pilot effort towards the challenge.In particular,a system framework(HisTrace)is introduced,which aims at an efficient extraction of reports about historical events.Due to the sheer amount of data in Webinfomall and the noisy nature of web pages,it turns out that many engineering issues must be addressed.This report provides an analysis of some of the major ones.Finally,we briefly describe the implementation status of HisTrace.
What problem does this paper attempt to address?