Abstract:Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the "forgiving nature" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster.

SCIPIS: Scalable and Concurrent Persistent Indexing and Search in High-End Computing Systems

Just-in-time Query Retrieval over Partially Indexed Data on Structured P2P Overlays

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

ParIS+: Data Series Indexing on Multi-Core Architectures

Scalable Top-K Spatial Keyword Search

Indexing multi-dimensional data in a cloud system.

SCISPACE: A Scientific Collaboration Workspace for File Systems in Geo-Distributed HPC Data Centers

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

Fast data series indexing for in-memory data

Best-effort semantic document search on GPUs

Coordinate-based Efficient Indexing Mechanism for Intelligent IoT Systems in Heterogeneous Edge Computing

A Scalable Learned Index Scheme in Storage Systems

PABIRS: A Data Access Middleware for Distributed File Systems

Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

SIMD-Optimized Search Over Sorted Data

Enabling Research through the SCIP Optimization Suite 8.0

Exploring Scientific Application Performance Using Large Scale Object Storage

SCIP: A scalable, reproducible and open-source pipeline for morphological profiling of image cytometry and microscopy data

Benchmarking SciDB Data Import on HPC Systems

SCIP: A scalable, reproducible, and open‐source pipeline for morphological profiling image cytometry and microscopy data

SALI: A Scalable Adaptive Learned Index Framework based on Probability Models