To Store or Not to Store: a graph theoretical approach for Dataset Versioning
Anxin Guo,Jingwei Li,Pattara Sukprasert,Samir Khuller,Amol Deshpande,Koyel Mukherjee
2024-02-19
Abstract:In this work, we study the cost efficient data versioning problem, where the
goal is to optimize the storage and reconstruction (retrieval) costs of data
versions, given a graph of datasets as nodes and edges capturing edit/delta
information. One central variant we study is MinSum Retrieval (MSR) where the
goal is to minimize the total retrieval costs, while keeping the storage costs
bounded. This problem (along with its variants) was introduced by Bhattacherjee
et al. [VLDB'15]. While such problems are frequently encountered in
collaborative tools (e.g., version control systems and data analysis
pipelines), to the best of our knowledge, no existing research studies the
theoretical aspects of these problems.
We establish that the currently best-known heuristic, LMG, can perform
arbitrarily badly in a simple worst case. Moreover, we show that it is hard to
get $o(n)$-approximation for MSR on general graphs even if we relax the storage
constraints by an $O(\log n)$ factor. Similar hardness results are shown for
other variants. Meanwhile, we propose poly-time approximation schemes for
tree-like graphs, motivated by the fact that the graphs arising in practice
from typical edit operations are often not arbitrary. As version graphs
typically have low treewidth, we further develop new algorithms for bounded
treewidth graphs.
Furthermore, we propose two new heuristics and evaluate them empirically.
First, we extend LMG by considering more potential ``moves'', to propose a new
heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable
run time on a wide variety of datasets, i.e., version graphs. Secondly, we
apply our tree algorithms on the minimum-storage arborescence of an instance,
yielding algorithms that are qualitatively better than all previous heuristics
for MSR, as well as for another variant BoundedMin Retrieval (BMR).
Distributed; Parallel; and Cluster Computing,Computational Complexity,Databases,Data Structures and Algorithms