Research of Massive Internet Text Data Real-Time Loading and Index System

WeiHong Han,Yan Jia,ShuQiang Yang
DOI: https://doi.org/10.1109/NCM.2009.414
2009-01-01
Abstract:With rapid development of the Internet and communication technology, massive text data has been accumulated in Internet, including text data on network pages, emails, instant messengers and etc. Requirements on increasing data volume, real-time data-loading and creating text indexes pose enormous challenges to data-loading techniques. This paper presents a data loading system in real time, text-loader that is used in ITSR (Internet text data real-time storage and retrieval system). Text-loader consists of an efficient algorithm for bulk data loading and exchange partition mechanism, increasing text index creation algorithm, optimized parallelism, and guidelines for system tuning. Performance studies show the positive effects of these techniques with loading speed of every Cluster, increasing from 220 million records per day to 1.2 billion per day, and achieving the top loading speed of 6TB data when 10 Clusters are in parallel. This framework offers a promising approach for loading other large and complex text databases.
What problem does this paper attempt to address?