Missing Information Management for Massive Sparse Data

Yuxin Chen,Shun Li,Jiahui Yao
DOI: https://doi.org/10.1109/bds/hpsc/ids18.2018.00058
2018-01-01
Abstract:Finding out the method of handling the missing information is essential for system efficiency and robustness in the field of the database. The sparsity of massive data in the big data environment makes the problem of missing information more prominent. The existing methods either have limited semantic expression ability or do not consider the influence of big data environment. Missing information in large-scale sparse data tends to have richer semantics, leading to more complex computational logic, as well as affecting operations such as data queries. To solve these problems, this paper proposes a novel missing information management method of logic operation definition and relational algebra expansion. Combining the practical case of big data environment, we summarize the missing information into two types: unknown value and nonexistent value, and define four-valued logic to support the logic operation. Based on the dynamic table model, we systematically extend the relational algebra to describe the data operations for massive sparse data. Our method is implemented in the self-developed big data management system Muldas. Experimental results on real large-scale sparse data set show the proposed four-valued logic and the relational algebra expansion of missing information have the good ability of semantic expression and computational efficiency.
What problem does this paper attempt to address?