Why Are Learned Indexes So Effective but Sometimes Ineffective?

Qiyu Liu,Siyuan Han,Yanlin Qi,Jingshu Peng,Jin Li,Longlong Lin,Lei Chen
2024-10-02
Abstract:Learned indexes have attracted significant research interest due to their ability to offer better space-time trade-offs compared to traditional B+-tree variants. Among various learned indexes, the PGM-Index based on error-bounded piecewise linear approximation is an elegant data structure that has demonstrated \emph{provably} superior performance over conventional B+-tree indexes. In this paper, we explore two interesting research questions regarding the PGM-Index: (a) \emph{Why are PGM-Indexes theoretically effective?} and (b) \emph{Why do PGM-Indexes underperform in practice?} For question~(a), we first prove that, for a set of $N$ sorted keys, the PGM-Index can, with high probability, achieve a lookup time of $O(\log\log N)$ while using $O(N)$ space. To the best of our knowledge, this is the \textbf{tightest bound} for learned indexes to date. For question~(b), we identify that querying PGM-Indexes is highly memory-bound, where the internal error-bounded search operations often become the bottleneck. To fill the performance gap, we propose PGM++, a \emph{simple yet effective} extension to the original PGM-Index that employs a mixture of different search strategies, with hyper-parameters automatically tuned through a calibrated cost model. Extensive experiments on real workloads demonstrate that PGM++ establishes a new Pareto frontier. At comparable space costs, PGM++ speeds up index lookup queries by up to $\mathbf{2.31\times}$ and $\mathbf{1.56\times}$ when compared to the original PGM-Index and state-of-the-art learned indexes.
Databases
What problem does this paper attempt to address?
This paper attempts to solve two main problems: 1. **Why is PGM - Index theoretically effective?** - The paper first proves that for a set of \( N \) sorted keys, PGM - Index can achieve a lookup time of \( O(\log \log N) \) with high probability while using only \( O(N) \) space. This is the tightest theoretical bound for learning - based index structures to date. 2. **Why does PGM - Index sometimes perform poorly in practice?** - The paper points out that querying PGM - Index is highly memory - constrained, and the internal bounded - error search operations (such as standard binary search) often become performance bottlenecks. To solve this problem, the authors propose PGM++, a simple and effective extension of the original PGM - Index. It improves performance by mixing different search strategies and automatically adjusting hyper - parameters through a calibrated cost model. ### Theoretical Effectiveness To answer the first question, the paper establishes new theoretical results, proving that the lookup time complexity of PGM - Index can reach \( O(\log \log N) \) and the required space complexity is \( O(N/G) \), where \( G \) is a constant determined by the data distribution characteristics and the error constraint \( \epsilon \). This shows that PGM - Index has superior performance theoretically compared to the traditional B+ - tree. ### Inefficiency in Practice Regarding the second question, by analyzing the results of extensive benchmark tests on multiple hardware platforms, the paper finds that PGM - Index has a memory bottleneck in practical applications, especially when handling index lookup queries. The internal bounded - error search operations (such as standard binary search) become performance bottlenecks. Experiments show that less than 1% of the internal segments occupy more than 80% of the total index lookup time. ### Solution To solve these problems, the authors propose PGM++, an improved PGM - Index structure. It improves search efficiency by combining linear search and highly optimized branch - free binary search. In addition, the authors develop a cost model for automatically adjusting hyper - parameters to better balance index lookup efficiency and index size. ### Experimental Verification Extensive experimental research shows that, under the same memory footprint, PGM++ improves the index lookup speed by up to 2.31 times and 1.56 times compared to the original PGM - Index and the state - of - the - art learning - based index structures respectively. For example, on a resource - constrained device (such as MacBook Air 2024), PGM++ can achieve an index lookup time of less than 400 nanoseconds on 800 million keys, requiring only 0.28 MB of memory. In conclusion, this paper not only provides new theoretical bounds for PGM - Index but also proposes an effective improvement method, PGM++, which significantly improves its practical performance.