Temporal Provenance Discovery in Micro-Blog Message Streams (abstract Only)

Zijun Xue,Junjie Yao,Bin Cui
DOI: https://doi.org/10.1145/2213836.2213973
2012-01-01
Abstract:Recent years have witnessed the flourishing increases of micro-blog message applications. Prominent examples include Twitter, Facebook's status, and Sina Weibo in China. Messages in these applications are short (140 characters in a message) and easy to create. The subscription and re-sharing features also make it fairly intuitive to propagate. Micro-blog applications provide abundant information to present world scale user interests and social pulse in an unexpected way. But the precious corpus also brings out the noise and fast changing fragments to prohibit effective understanding and management. In this work, we propose a micro-blog provenance model to capture temporal connections within micro-blog messages. Here, provenance refers to data origin identification and transformation logging, demonstrating of great value in recent database and workflow systems. The provenance model is used to represent the message development trail and changes explicitly. We select various types of connections in micro-blog applications to identify the provenance. To cope with the real time micro-message deluge, we discuss a novel message grouping approach to encode and maintain the provenance information. A summary index structure is utilized to enable efficient provenance updating. We collect in-coming messages and compare them with an in-memory index to associate them with related ones. The closely related messages form some virtual provenance representation in a coarse granularity. We periodically dump memory values onto disks. In the actual implementation, we also introduce several adaptive pruning strategies to extend the potential of provenance discovery efficiency. We use the temporal decaying and granularity levels to filter out low chance messages. In the demonstration, we reveal the usefulness of provenance information for rich query retrieval and dynamic message tracking for effective message organization. The real-time collection approach shows advantages over some baselines. Experiments conducted on a real dataset verify the effectiveness and efficiency of our provenance approach. Results show that the partial-indexing strategy and other restriction ones can maintenance the accuracy at 90% and returning rate at 60% with a reasonable low memory usage. This is the first work towards provenance-based indexing support for micro-blog platforms.
What problem does this paper attempt to address?