Abstract:Excessive computational cost for learning large data and streaming data can be alleviated by using stochastic algorithms, such as stochastic gradient descent and its variants. Recent advances improve stochastic algorithms on convergence speed, adaptivity and structural awareness. However, distributional aspects of these new algorithms are poorly understood, especially for structured parameters. To develop statistical inference in this case, we propose a class of generalized regularized dual averaging (gRDA) algorithms with constant step size, which improves RDA (Xiao, 2010; Flammarion and Bach, 2017). Weak convergence of gRDA trajectories are studied, and as a consequence, for the first time in the literature, the asymptotic distributions for online l1 penalized problems become available. These general results apply to both convex and non-convex differentiable loss functions, and in particular, recover the existing regret bound for convex losses (Nemirovski et al., 2009). As important applications, statistical inferential theory on online sparse linear regression and online sparse principal component analysis are developed, and are supported by extensive numerical analysis. Interestingly, when gRDA is properly tuned, support recovery and central limiting distribution (with mean zero) hold simultaneously in the online setting, which is in contrast with the biased central limiting distribution of batch Lasso (Knight and Fu, 2000). Technical devices, including weak convergence of stochastic mirror descent, are developed as by-products with independent interest. Preliminary empirical analysis of modern image data shows that learning very sparse deep neural networks by gRDA does not necessarily sacrifice testing accuracy.

Learning by Extrapolation from Marginal to Full-Multivariate Probability Distributions: Decreasingly Naive Bayesian Classification

Self-Adaptive Attribute Value Weighting for Averaged One-Dependence Estimators.

Alleviating the Attribute Conditional Independence and I.I.D. Assumptions of Averaged One-Dependence Estimator by Double Weighting

Averaged Tree-Augmented One-Dependence Estimators

Attribute Value Weighted Average of One-Dependence Estimators

A Generic Ensemble Approach to Estimate Multidimensional Likelihood in Bayesian Classifier Learning.

Semi-naive Exploitation of One-Dependence Estimators

Sample-Based Attribute Selective A$n$ DE for Large Data

Alleviating the independence assumptions of averaged one-dependence estimators by model weighting

Bagging K-Dependence Bayesian Network Classifiers

Selective AnDE for Large Data Learning: a Low-Bias Memory Constrained Approach

Inference for High-Dimensional Linear Expectile Regression with De-Biasing Method

General and Local: Averaged K-Dependence Bayesian Classifiers

Attribute Value Weighted Averaged One-Dependence Estimators with Kullback–Leibler Divergence

Efficient heuristics for learning scalable Bayesian network classifier from labeled and unlabeled data

Model Weighting for One-Dependence Estimators by Measuring the Independence Assumptions

Extracting Credible Dependencies for Averaged One-Dependence Estimator Analysis

A Bayes Risk Minimization Machine for Example-Dependent Cost Classification

Nonparametric Bayes Classification via Learning of Affine Subspaces

Discriminatory Target Learning: Mining Significant Dependence Relationships from Labeled and Unlabeled Data.

A generalization of regularized dual averaging and its dynamics