Abstract:Graph analytics are at the heart of a broad range of applications such as drug discovery, page ranking, and recommendation systems. When graph size exceeds memory size, out-of-core graph processing is needed. For the widely used external memory graph processing systems, accessing storage becomes the bottleneck. We make the observation that nearly all graph algorithms have a dynamically varying number of active vertices that must be processed in each iteration. However, existing graph processing frameworks, such as GraphChi, load the entire graph in each iteration even if a small fraction of the graph is active. This limitation is due to the structure of the data storage used by these systems. In this work, we propose to use a compressed sparse row (CSR) based graph storage that is more amenable for selectively loading only a few active vertices in each iteration. But CSR based systems suffers from random update propagation to many target vertices. To solve this challenge, we propose to use a multi-log update mechanism that logs updates separately, rather than directly update the active edges in a graph. Our proposed multi-log system maintains a separate log per each vertex interval. This separation enables us to efficiently process each vertex interval by just loading the corresponding log. Further, while accessing SSD pages with fewer active vertex data, we reduce the read amplification due to the page granular accesses in SSD by logging the active vertex data in the current iteration and efficiently reading the log in the next iteration. Over the current state of the art out-of-core graph processing framework, our PartitionedVC improves performance by up to $17.84\times$, $1.19\times$, $1.65\times$, $1.38\times$, $3.15\times$, and $6.00\times$ for the widely used bfs, pagerank, community detection, graph coloring, maximal independent set, and random-walk applications, respectively.

Characterizing the Dilemma of Performance and Index Size in Billion-Scale Vector Search and Breaking It with Second-Tier Memory

ESPN: Memory-Efficient Multi-Vector Information Retrieval

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Search-in-Memory: Reliable, Versatile, and Efficient Data Matching in SSD's NAND Flash Memory Chip for Data Indexing Acceleration

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

SPFresh: Incremental In-Place Update for Billion-Scale Vector Search

Reducing the Storage Overhead of Main-Memory OLTP Databases with Hybrid Indexes

Compact Indexing and Judicious Searching for Billion-Scale Microblog Retrieval.

Cognitive Ssd: A Deep Learning Engine For In-Storage Data Retrieval

Search-in-Memory (SiM): Reliable, Versatile, and Efficient Data Matching in SSD's NAND Flash Memory Chip for Data Indexing Acceleration

A Scalable Learned Index Scheme in Storage Systems

Log-Compact R-Tree: An Efficient Spatial Index For Ssd

DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]

Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

Indexing very high-dimensional sparse and quasi-sparse vectors for similarity searches

PartitionedVC: Partitioned External Memory Graph Analytics Framework for SSDs

MBFGraph: An SSD-Based Analytics System for Evolving Graphs

Optimizing LSM-based indexes for disaggregated memory

Indexing high-dimensional data for efficient in-memory similarity search

Using Bitmap Index to Accelerate Accessing Large Scale Scientific Data on Demand

Improving In-Memory File System Reading Performance by Fine-Grained User-Space Cache Mechanisms