Abstract:In recent work, we have shown that NVIDIA's raytracing cores on RTX video cards can be exploited to realize hardware-accelerated lookups for GPU-resident database indexes. On a high level, the concept materializes all keys as triangles in a 3D scene and indexes them. Lookups are performed by firing rays into the scene and utilizing the index structure to detect hits in a hardware-accelerated fashion. While this approach called RTIndeX (or short RX) is indeed promising, it currently suffers from three limitations: (1) significant memory overhead per key, (2) slow range-lookups, and (3) poor updateability. In this work, we show that all three problems can be tackled by a single design change: Generalizing RX to become a coarse-granular index cgRX. Instead of indexing individual keys, cgRX indexes buckets of keys which are post-filtered after retrieval. This drastically reduces the memory overhead, leads to the generation of a smaller and more efficient index structure, and enables fast range-lookups as well as updates. We will see that representing the buckets in the 3D space such that the lookup of a key is performed both correctly and efficiently requires the careful orchestration of firing rays in a specific sequence. Our experimental evaluation shows that cgRX offers the most bang for the buck(et) by providing a throughput in relation to the memory footprint that is 1.5-3x higher than for the comparable range-lookup supporting baselines. At the same time, cgRX improves the range-lookup performance over RX by up to 2x and offers practical updateability that is up to 5.5x faster than rebuilding from scratch.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the three main limitations of existing hardware - accelerated index structures (such as RTIndex, abbreviated as RX) on GPU: significant memory overhead, slow range query performance, and poor update ability. Specifically: 1. **Significant memory overhead**: - Each 64 - bit integer key needs to be represented by a triangle described by nine 32 - bit floating - point numbers, resulting in a 78% memory overhead. - Due to the limited memory resources on GPU, this high memory overhead restricts many application scenarios. 2. **Slow range query performance**: - Range queries in RX require a large number of BVH traversals and ray - triangle intersection tests, which are much slower than the traditional B+ tree. - The B+ tree can complete range queries by finding the lower - bound key with a single tree traversal and then sequentially scanning the leaf nodes. 3. **Poor update ability**: - After an update operation, the query performance of RX will decline severely, with a slowdown of up to 78 times. - This is because during the BVH update process, only the existing bounding volumes are extended, increasing the number of intersection tests required during queries. To solve these problems, the author proposes a new design method: generalizing the fine - grained index RX to the coarse - grained index cgRX. By aggregating individual keys into buckets and filtering the buckets after queries, cgRX can significantly reduce memory overhead, improve range query performance, and support more efficient update operations. ### Main improvement points - **Memory efficiency**: By aggregating multiple keys into one bucket and representing it with a single triangle, cgRX reduces memory overhead. For example, for a bucket of size 3, the memory overhead is reduced from 78% to 48%. - **Range query performance**: cgRX improves the speed of range queries through an optimized BVH structure and fewer intersection tests, being 2 times faster than the original RX. - **Update performance**: cgRX supports efficient batch updates, and the query performance after updates is 5.5 times faster than rebuilding the index. ### Summary The main contribution of this paper lies in solving the problems of large memory overhead, slow range queries, and poor update performance of existing hardware - accelerated index structures on GPU by introducing the coarse - grained index cgRX. Experimental results show that cgRX performs excellently in terms of throughput - to - memory - footprint ratio, range query performance, hit - rate influence, query - skew influence, and update performance.

More Bang For Your Buck(et): Fast and Space-efficient Hardware-accelerated Coarse-granular Indexing on GPUs

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing

Efficient Data Management for Incoherent Ray Tracing.

A Fast Sah-Based Construction of Octree

Complex Shading Efficiently for Ray Tracing on GPU

Accelerating Range Minimum Queries with Ray Tracing Cores

DyCuckoo: Dynamic Hash Tables on GPUs.

RTScan: Efficient Scan with Ray Tracing Cores

GPU-Based Shooting and Bouncing Ray Method for Fast RCS Prediction

Fast GPU perspective grid construction and triangle tracing for exhaustive ray tracing of highly coherent rays

A Generic Inverted Index Framework for Similarity Search on the GPU - Technical Report

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Taking the Shortcut: Actively Incorporating the Virtual Memory Index of the OS to Hardware-Accelerate Database Indexing

Using Hardware Ray Transforms to Accelerate Ray/Primitive Intersections for Long, Thin Primitive Types.

G-Learned Index: Enabling Efficient Learned Index on GPU

Parallel Processing of Dynamic Continuous Queries over Streaming Data Flows

Index Search Algorithms for Databases and Modern CPUs

Compact Parallel Hash Tables on the GPU

Relational Query Co-Processing on Graphics Processors1

CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs

An Efficient Update Mechanism for GPU-Based IP Lookup Engine Using Threaded Segment Tree