Zebra: A Novel Method for Optimizing Text Classification Query in Overload Scenario

Tianhuan Yu,Zhenying He,Zhihui Yang,Fei Ye,Yuankai Fan,Yinan Jing,Kai Zhang,X. Sean Wang
DOI: https://doi.org/10.1007/s11280-022-01061-y
2022-01-01
World Wide Web
Abstract:Text classification is a crucial task in the text mining field, and it can be included in queries with user-defined functions(UDF). In many web applications, such as Twitter mining or Weibo real-time processing, when the amount of text data to be processed is enormous, there will be many overload phenomena. At the same time, when the system is overloaded, the delays in the query process can negatively affect the user experience in a streaming scenario. This paper focuses on the query with text classification on streaming data. We propose a novel method called Zebra with progressive pipelines to optimize the overload query situations. The core module of Zebra is the probabilistic filter which can reduce an incredible amount of text data based on semantic information of the query predicate. We train weak classifiers as filters using data with labels from brute-force pipelines. Next, we use a parameter search method to choose a suitable filter with the best settings and apply it to progressive pipelines. Experiments with several text workloads on real-world datasets show that Zebra can achieve higher accuracy stably while answering the query in time.
What problem does this paper attempt to address?