Abstract:Due to their size and complexity, massive data sets bring many computational challenges for statistical analysis, such as overcoming the memory limitation and improving computational efficiency of traditional statistical methods. In the dissertation, I propose the statistical aggregation strategy to conquer such challenges posed by massive data sets. Statistical aggregation partitions the entire data set into smaller subsets, compresses each subset into certain low-dimensional summary statistics and aggregates the summary statistics to approximate the desired computation based on the entire data. Results from statistical aggregation are required to be asymptotically equivalent. Statistical aggregation processes the entire data set part by part, and hence overcomes memory limitation. Moreover, statistical aggregation can also improve the computational efficiency of statistical algorithms with computational complexity at the order of O(Nm) (m 1) or even higher, where N is the size of the data. Statistical aggregation is particularly useful for online analytical processing (OLAP) in data cubes and stream data, where fast response to queries is the top priority. The "partition-compression-aggregation" strategy in statistical aggregation actually has been considered previously for OLAP computing in data cubes. But existing research in this area tends to overlook the statistical property of the analysis and aims to obtain identical results from aggregation, which has limited the application of this strategy to very simple analyses. Statistical aggregation instead can support OLAP in more sophisticated statistical analyses. In this dissertation, I apply statistical aggregation to two large families of statistical methods, estimating equation (EE) estimation and U-statistics, develop proper compression-aggregation schemes and show that the statistical aggregation tremendously reduces their computational burden while maintaining their efficiency. I further apply statistical aggregation to U-statistic based estimating equations and propose new estimating equations that need much less computational time but give asymptotically equivalent estimators.

Compression and Aggregation for Logistic Regression Analysis in Data Cubes

Regression Cubes with Lossless Compression and Aggregation

Towards the Building of a Dense-Region-based OLAP System

DROLAP - A Dense-Region Based Approach to On-Line Analytical Processing

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing

Number : wucse-2009-30 2009 Online Bayesian Analysis

A Clustered Dwarf Structure to Speed Up Queries on Data Cubes

Statistical aggregation: theory and applications

LeCo: Lightweight Compression Via Learning Serial Correlations

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

The computation of semantic data cube

Data Compression for Analytics over Large-scale In-memory Column Databases

APIC: An Efficient Algorithm for Computing Iceberg Datacubes

Covariate Microaggregation for Logistic Regression: An Application for Analysis of Confidential Data

Logzip: Extracting Hidden Structures via Iterative Clustering for Log Compression

A Method for Compressing Parameters in Bayesian Models with Application to Logistic Sequence Prediction Models

Data Compression using Rank-1 Lattices for Parameter Estimation in Machine Learning

Supporting Regularized Logistic Regression Privately and Efficiently

Bayesian Analysis in Data Cubes

Holistic Cube Analysis: A Query Framework for Data Insights

Step-by-Step Regression: A More Efficient Alternative for Polynomial Multiple Linear Regression in Stream Cube