Why TPC is Not Enough: An Analysis of the Amazon Redshift Fleet
Alexander van Renen,Dominik Horn,Pascal Pfeil,Kapil Vaidya,Wenjian Dong,Murali Narayanaswamy,Zhengchun Liu,Gaurav Saxena,Andreas Kipf,Tim Kraska
DOI: https://doi.org/10.14778/3681954.3682031
IF: 2.5
2024-07-01
Proceedings of the VLDB Endowment
Abstract:Database research and development is heavily influenced by benchmarks, such as the industry-standard TPC-H and TPC-DS for analytical systems. However, these twenty-year-old benchmarks neither capture how databases are deployed nor what workloads modern cloud data warehouse systems face these days. In this paper, we summarize well-known, confirm suspected, and unearth novel discrepancies between TPC-H/DS and actual workloads using empirical data. We base our analysis on telemetrics from Amazon Redshift - one of the largest cloud data warehouse deployments. Among others, we show how write-heavy data pipelines are prominent, workloads vary over time (in both load and type), queries are repetitive, and how most properties of queries or workloads experience very long tailed distributions. We conclude that data warehouse benchmarks, just like database systems, need to become more holistic and stop focusing solely on query engine performance. Finally, we publish a dataset containing query statistics of 200 randomly selected Redshift serverless and provisioned instances (each) over a three-month period, as a basis for building more realistic benchmarks.
computer science, information systems, theory & methods