Learning Autoregressive Model in LSM-Tree Based Store

Yunxiang Su,Wenxuan Ma,Shaoxu Song
DOI: https://doi.org/10.1145/3580305.3599405
2023-01-01
Abstract:Database-native machine learning operators are highly desired for efficient I/O and computation costs. While most existing machine learning algorithms assume the time series data fully available and readily ordered by timestamps, it is not the case in practice. Commodity time series databases store the data in pages with possibly overlapping time ranges, known as LSM-Tree based storage. Data points in a page could be incomplete, owing to either missing values or out-of-order arrivals, which may be inserted by the imputed or delayed points in the following pages. Likewise, data points in a page could also be updated by others in another page, for dirty data repairing or re-transmission. A straightforward idea is thus to first merge and order the data points by timestamps, and then apply the existing learning algorithms. It is not only costly in I/O but also prevents pre-computation of model learning. In this paper, we propose to offline learn the AR models locally in each page on incomplete data, and online aggregate the stored models in different pages with the consideration of the aforesaid inserted and updated data points. Remarkably, the proposed method has been deployed and included as a function in an open source time series database, Apache IoTDB. Extensive experiments in the system demonstrate that our proposal LSMAR shows up to one order-of-magnitude improvement in learning time cost. It needs only about 10s of milliseconds for learning over 1 million data points.
What problem does this paper attempt to address?