Programmable Dataflows: Abstraction and Programming Model for Data Sharing

Siyuan Xia,Chris Zhu,Tapan Srivastava,Bridget Fahey,Raul Castro Fernandez
2024-08-08
Abstract:Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we distill an abstraction, the contract, which agents use to communicate the intent of a dataflow and evaluate its consequences, before the dataflow takes place. This helps agents move towards a common sharing goal without violating any regulatory and privacy constraints. Then, we design and implement the contract programming model (CPM), which allows agents to program data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it can save intermediate outputs of dataflows, and skip computation if a dataflow tries to access data that it does not have access to. In our evaluation, we show that 1) the contract abstraction is general enough to represent a wide range of sharing problems, 2) we can write programs for complex data sharing problems and exhibit qualitative improvements over other alternate technologies, and 3) quantitatively, our optimizations make sharing programs written with the CPM efficient.
Databases
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper attempts to address several core issues in data sharing: 1. **Lack of a universal data sharing abstraction**: Current data sharing solutions are often customized, requiring a bespoke design for each problem, leading to high costs and hindering value generation. 2. **Information asymmetry**: In the data sharing process, participants often lack sufficient information to assess whether a data flow meets their goals and constraints, thus defaulting to not sharing data to avoid potential risks. 3. **Performance inefficiency**: Existing data sharing methods may be inefficient due to frequent manual interventions and complex compliance checks. To address these issues, the paper proposes the following contributions: - **New data sharing model**: Representing each data sharing problem as a series of dataflows, where each dataflow represents data exchange between participants. - **Contract abstraction**: Introducing a contract abstraction that allows participants to explicitly define who contributes data, how the data is processed, who receives the results, and under what conditions before the dataflow occurs. This helps participants achieve common sharing goals without violating regulatory and privacy constraints. - **Contract Programming Model (CPM)**: Designing and implementing a contract programming model that enables developers to write data sharing applications for specific problems, significantly reducing the cost of handling data sharing issues. - **Optimization mechanisms**: To improve performance, extending CPM to save intermediate outputs and skip inaccessible dataflows, thereby reducing unnecessary computations. Through these contributions, the paper aims to provide a programmable dataflow paradigm that allows participants to analyze sharing problems, identify dataflows that should or should not occur, and write programs to control the data sharing process, ensuring compliance with regulatory, compliance, and privacy preferences.