Aster Models with Random Effects via Penalized Likelihood
C. Geyer,C. Ridley,R. Latta,Julie R. Etterson,R. G. Shaw
2012-10-09
Abstract:This technical report works out details of approximate maximum likelihood estimation for aster models with random effects. Fixed and random effects are estimated by penalized log likelihood. Variance components are estimated by integrating out the random effects in the Laplace approximation of the complete data likelihood following Breslow and Clayton (1993), which can be done analytically, and maximizing the resulting approximate missing data likelihood. A further approximation treats the second derivative matrix of the cumulant function of the exponential family where it appears in the approximate missing data log likelihood as a constant (not a function of parameters). Then first and second derivatives of the approximate missing data log likelihood can be done analytically. Minus the second derivative matrix of the approximate missing data log likelihood is treated as approximate Fisher information and used to estimate standard errors. 1 Theory Aster models (Geyer, Wagenius and Shaw, 2007; Shaw, Geyer, Wagenius, Hangelbroek, and Etterson, 2008) have attracted much recent attention. Several researchers have raised the issue of incorporating random effects in aster models, and we do so here. 1.1 Complete Data Log Likelihood Although we are particularly interested in aster models (Geyer et al., 2007), our theory works for any exponential family model. The log likelihood can be written l(φ) = yφ− c(φ), where y is the canonical statistic vector, φ is the canonical parameter vector, and the cumulant function c satisfies μ(φ) = Eφ(y) = c ′(φ) (1) W (φ) = varφ(y) = c ′′(φ) (2) where c′(φ) denotes the vector of first partial derivatives and c′′(φ) denotes the matrix of second partial derivatives. We assume a canonical affine submodel with random effects determined by φ = a+Mα+ Zb, (3) where a is a known vector, M and Z are known matrices, b is a normal random vector with mean vector zero and variance matrix D. The vector a is called the offset vector and the matrices M and Z are called the model matrices for fixed and random effects, respectively, in the terminology of the R function glm. (The vector a is called the origin in the terminology of the R function aster. Design matrix is alternative terminology for model matrix.) The matrix D is assumed to be diagonal, so the random effects are independent random variables. The diagonal components of D are called variance components in the classical terminology of random effects models (Searle et al., 1992). Typically the components of b are divided into blocks having the same variance (Searle et al., 1992, Section 6.1), so there are only a few variance components but many random effects, but nothing in this document uses this fact. The unknown parameter vectors are α and ν, where ν is the vector of variance components. Thus D is a function of ν, although this is not indicated by the notation. The “complete data log likelihood” (i. e., what the log likelihood would be if the random effect vector b were observed) is lc(α, b, ν) = l(a+Mα+ Zb)− 12b TD−1b− 12 log det(D) (4) in case none of the variance components are zero. We deal with the case of zero variance components in Sections 1.9, 1.10, and 1.11 below. 1.2 Missing Data Likelihood Ideally, inference about the parameters should be based on the missing data likelihood, which is the complete data likelihood with random effects b integrated out Lm(α, ν) = ∫ ec db (5) Maximum likelihood estimates (MLE) of α and ν are the values that maximize (5). However MLE are hard to find. The integral in (5) cannot be done analytically, nor can it be done by numerical integration except in very simple cases. There does exist a large literature on doing such integrals by ordinary or Markov chain Monte Carlo (Penttinen, 1984; Thompson and Guo, 1991; Geyer and Thompson, 1992; Geyer, 1994; Shaw, Promislow, Tatar, Hughes, and Geyer, 1999; Shaw, Geyer and Shaw, 2002; Booth and Hobert, 1999; Sung and Geyer, 2007; Hunter, Handcock, Butts, Goodreau and Morris, 2008; Okabayashi and Geyer, 2011; Hummel, Hunter and Handcock, to appear), but these methods take a great deal of computing time and are difficult for ordinary users to apply. We wish to avoid that route if at all possible. 1.3 A Digression on Minimization The theory of constrained optimization (Section 1.10 below) has a bias in favor of minimization rather than maximization. The explication below will be simpler if we switch now from maximization to minimization (minimizing minus the log likelihood) rather than switch later. 1.4 Laplace Approximation Breslow and Clayton (1993) proposed to replace the integrand in (5) by its Laplace approximation, which is proportional to a normal probability density function so the random effects can be integrated out analytically. Let b∗ denote the result of maximizing (4) considered as a function of b for fixed α and ν. Then − logLm(α, ν) is approximated by q(α, ν) = 12 log det[κ ′′(b∗)] + κ(b∗)