GeaFlow: A Graph Extended and Accelerated Dataflow System.

Zhenxuan Pan,Tao Wu,Qingwen Zhao,Qiang Zhou,Zhiwei Peng,Jiefeng Li,Qi Zhang,Guanyu Feng,Xiaowei Zhu
DOI: https://doi.org/10.1145/3589771
2023-01-01
Abstract:GeaFlow is a distributed dataflow system optimized for streaming graph processing, and has been widely adopted at Ant Group, serving various scenarios ranging from risk control of financial activities to analytics on social networks and knowledge graphs. It is built on top of a base with full-fledged stateful stream processing capabilities, extended with a series of graph-aware optimizations to address the space explosion and programming complexity issues of conventional join-based approaches. We propose new state backends and streaming operators that facilitate processing on dynamic graph-structured datasets, reducing space consumed by states. We develop a hybrid domain-specific language that embeds Gremlin into SQL, supporting both table and graph abstractions over streaming data. In addition to streaming workloads, GeaFlow is also extensively used for some batch processing jobs. In the largest deployments to date, GeaFlow is able to process tens of millions of events per second and manage hundreds of terabytes of states.
What problem does this paper attempt to address?