Abstract:Cloud logs can be categorized into on-line, off-line, and near-line logs based on the access frequency. Among them, near-line logs are mainly used for debugging, which means they prefer a low query latency for better user experience. Besides, the storage system for near-line logs prefers a low overall cost including the storage cost to store compressed logs, and the computation cost to compress logs and execute queries. These requirements pose challenges to achieving fast and cheap cloud log storage. This article proposes LogGrep, the first log compression and query tool that exploits both static and runtime patterns to properly structurize and organize log data in fine-grained units. The key idea of LogGrep is “vertical partitioning”: it stores each log entry into multiple partitions by first parsing logs into variable vectors according to static patterns and then extracting runtime pattern(s) automatically within each variable vector. Based on such runtime patterns, LogGrep further decomposes the variable vectors into fine-grained units called “Capsules” and stamps each Capsule with a summary of its values. During the query process, LogGrep can avoid decompressing and scanning Capsules that cannot match the keywords, with the help of the extracted runtime patterns and the Capsule stamps. We further show that the interactive debugging can well utilize the advantages of the vertical-partitioning-based method and mitigate its weaknesses as well. To this end, LogGrep integrates incremental locating and partial reconstruction to mitigate the read amplification incurred by vertical-partitioning-based method. We evaluate LogGrep on 37 cloud logs from the production environment of Alibaba Cloud and the public datasets. The results show that LogGrep can reduce the query latency and the overall cost by an order of magnitude compared with state-of-the-art works. Such results have confirmed that it is worthwhile applying a more sophisticated vertical-partitioning-based method to accelerate queries on compressed cloud logs.

Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines

Towards Optimizing Storage Costs on the Cloud

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

A Clustered Dwarf Structure to Speed Up Queries on Data Cubes

Scalable Data Partitioning Techniques for Distributed Data Processing in Cloud Environments: A Review

Optimizing Data Migration Using Online Clustering.

A Moveable Beast: Partitioning Data and Compute for Computational Storage

Partition-based Data Cube Storage and Parallel Queries for Cloud Computing

Moving big data to the cloud

Cost-Based Optimization Of Logical Partitions For A Query Workload In A Hadoop Data Warehouse

Cloud-of-Clouds Storage Made Efficient: A Pipeline-Based Approach

Rethinking the Cloudonomics of Efficient I/O for Data-Intensive Analytics Applications

Moving Big Data to The Cloud: An Online Cost-Minimizing Approach

Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning Techniques

Applying Delta Compression to Packed Datasets for Efficient Data Reduction

Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log Storage

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Droplet: A Distributed Solution of Data Deduplication

An Adaptive Data Partitioning Scheme For Accelerating Exploratory Spark Sql Queries

Secure Data Processing in a Hybrid Cloud

A survey of data partitioning and sampling methods to support big data analysis