L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises

Xu-Bin Deng,Yang-Yong Zhu
DOI: https://doi.org/10.1007/s11390-005-0763-0
IF: 1.871
2005-01-01
Journal of Computer Science and Technology
Abstract:In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.
What problem does this paper attempt to address?