SCIPIS: Scalable and Concurrent Persistent Indexing and Search in High-End Computing Systems

Alexandru Iulian Orhean,Anna Giannakou,Lavanya Ramakrishnan,Kyle Chard,Boris Glavic,Ioan Raicu
DOI: https://doi.org/10.1016/j.jpdc.2024.104878
IF: 4.542
2024-03-27
Journal of Parallel and Distributed Computing
Abstract:While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.
computer science, theory & methods
What problem does this paper attempt to address?