SCM Repository
[matrix] / branches / trunk-lme4 / inst / doc / Implementation.Rnw |
View of /branches/trunk-lme4/inst/doc/Implementation.Rnw
Parent Directory | Revision Log
Revision 1211 -
(download)
(annotate)
Tue Jan 31 00:25:18 2006 UTC (15 years, 1 month ago) by bates
File size: 18274 byte(s)
Tue Jan 31 00:25:18 2006 UTC (15 years, 1 month ago) by bates
File size: 18274 byte(s)
More on implementation
\documentclass[12pt]{article} \usepackage{Sweave} \usepackage{myVignette} \usepackage[authoryear,round]{natbib} \bibliographystyle{plainnat} \DefineVerbatimEnvironment{Sinput}{Verbatim} {formatcom={\vspace{-2.5ex}},fontshape=sl, fontfamily=courier,fontseries=b, fontsize=\scriptsize} \DefineVerbatimEnvironment{Soutput}{Verbatim} {formatcom={\vspace{-2.5ex}},fontfamily=courier,fontseries=b,% fontsize=\scriptsize} %%\VignetteIndexEntry{Implementation Details} %%\VignetteDepends{Matrix} %%\VignetteDepends{lattice} %%\VignetteDepends{lme4} \begin{document} \SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,strip.white=TRUE} \SweaveOpts{prefix=TRUE,prefix.string=figs/Example,include=FALSE} \setkeys{Gin}{width=\textwidth} \title{Linear mixed model implementation in lme4} \author{Douglas Bates\\Department of Statistics\\University of Wisconsin -- Madison\\\email{Bates@wisc.edu}} \maketitle \begin{abstract} Expressions for the evaluation of the profiled log-likelihood or profiled log-restricted-likelihood of a linear mixed model, the gradients and Hessians of these criteria, and update steps for an ECME algorithm to optimize these criteria are given in Bates and DebRoy (2004). In this paper we generalize those formulae and describe the representation of mixed-effects models using sparse matrix methods available in the \code{Matrix} package. \end{abstract} <<preliminaries,echo=FALSE,print=FALSE>>= library(lattice) library(Matrix) #library(lme4) #data("Early", package = "mlmRev") options(width=80, show.signif.stars = FALSE, lattice.theme = function() canonical.theme("pdf", color = FALSE)) @ \section{Introduction} \label{sec:Intro} General formulae for the evaluation of the profiled log-likelihood and profiled log-restricted-likelihood in a linear mixed model are given in \citet{bate:debr:2004}. Here we describe a more general formulation of the model using sparse matrix decompositions and the implementation of these methods in the \code{lmer} function for \RR{}. In \S\ref{sec:Form} we describe the form and representation of the model. The calculation of the criteria to be optimized by the parameter estimates and related quantities is discussed in \S\ref{sec:Cholesky}. Details of the calculation of the ECME step and the evaluation of the gradients of the criteria are given in \S\ref{sec:ECME} and those of the Hessian in \S\ref{sec:Hessian}. In \S\ref{sec:Unconstrained} we give the details of an unconstrained parameterization for the model and the transformation of our results to this parameterization. \section{Form and representation of the model} \label{sec:Form} We consider linear mixed models of the form \begin{equation} \label{eq:lmeGeneral} \by=\bX\bbeta+\bZ\bb+\beps\quad \beps\sim\mathcal{N}(\bzer,\sigma^2\bI), \bb\sim\mathcal{N}(\bzer,\bSigma), \beps\perp\bb \end{equation} where $\by$ is the $n$-dimensional response vector, $\bX$ is an $n\times p$ model matrix for the $p$ dimensional fixed-effects vector $\bbeta$, $\bZ$ is the $n\times q$ model matrix for the $q$ dimensional random-effects vector $\bb$, which has a Gaussian distribution with mean $\bzer$ and variance-covariance matrix $\bSigma$, and $\beps$ is the random noise assumed to have a spherical Gaussian distribution. The symbol $\perp$ indicates independence of random variables. We will assume that $\bX$ has full column rank and that $\bSigma$ is positive definite. \subsection{Structure of the variance-covariance matrix} \label{sec:sigmaStructure} Components of the random effects vector $\bb$ and portions of its variance-covariance matrix $\bSigma$ are associated with $k$ grouping factors $\bbf_i, i=1,\dots,k$, each of length $n$, and with the $n_i, i = 1,\dots,k$ levels of each of the grouping factors. In general there are $q_i$ components of $\bb$ associated with each of the $n_i$ levels the grouping factor $\bbf_i, i = 1,\dots,k$. Thus \begin{equation} \label{eq:qdef} q = \sum_{i=1}^k n_i q_i \end{equation} We assume that the components of $\bb$ and the rows and columns of $\bSigma$ are ordered according to the $k$ grouping factors and, within the block for the $i$th grouping factor, according to the $n_i$ levels of the grouping factor. Random effects associated with different grouping factors are independent. This implies that $\bSigma$ is block-diagonal with $k$ diagonal blocks of orders $n_i q_i, i=1,\dots,k$. Random effects associated with different levels of the same grouping factor are independent. This implies that the $i$th (outer) diagonal block of $\bSigma$ is itself block diagonal in $n_i$ blocks of order $q_i$. We say that the structure of $\bSigma$ is block/block diagonal. Finally, the variance-covariance matrix within each of the $q_i$-dimensional subvectors of $\bb$ associated with one of the $n_i$ levels of grouping factor $\bbf_i, i=1,\dots,k$ is a constant (but unknown) positive-definite symmetric $q_i\times q_i$ matrix $\bSigma_i,i=1,\dots,k$. This implies that each of the $n_i$ inner diagonal blocks of order $q_i$ is a copy of $\bSigma_i$. We say that $\bSigma$ has a \emph{repeated block/block diagonal} structure. \subsection{The relative precision matrix} \label{sec:relativePrecision} Many of the computational formulae are more conveniently expressed in terms of $\bSigma^{-1}$, which is called the \emph{precision} matrix of the random effects, instead of $\bSigma$. In fact, the formulae are most conveniently expressed in terms of the \emph{relative precision matrix} $\sigma^2\bSigma^{-1}$ which we write as $\bOmega$. That is, \begin{equation} \label{eq:relPrec} \bOmega = \sigma^2\bSigma^{-1} \end{equation} This called the ``relative'' precision because it is precision of $\bb$ ($\bSigma^{-1}$) relative to the precision of $\beps$ ($\sigma^{-2}\bI$). It is easy to establish that $\bOmega$ will have a repeated block/block diagonal structure like that of $\bSigma$. That is, $\bOmega$ consists of $k$ outer diagonal blocks of sizes $n_i q_i, i = 1,\dots,k$ and the $i$th outer diagonal block is itself block diagonal with $n_i$ inner blocks of size $q_i\times q_i$. Furthermore, each of the inner diagonal blocks in the $i$th outer block is a copy of the $q_i\times q_i$ positive-definite, symmetric matrix $\bOmega_i$. Because $\bOmega$ has a repeated block/block structure we can define the entire matrix by specifying the symmetric matrices $\bOmega_i, i = 1,\dots,k$ and, because of the symmetry, $\bOmega_i$ has at most $q_i(q_i+1)/2$ distinct elements. We will write $\btheta$ for a parameter vector of length at most $\sum_{i=1}^{k}q_i(q_i+1)/2$ that determines $\bOmega$. For example, we could use the non-redundant elements in the $\bOmega_i$ as the components of $\btheta$. In fact we use a different, but equivalent, parameterization for reasons to be discussed later. We only need to store the matrices $\bOmega_i,i=1,\dots,k$ and the number of levels in the grouping factors to be able create $\bOmega$. The matrices $\bOmega_i$ are stored in the \code{Omega} slot of an object of class \code{"lmer"}. The values of $k$ and $n_i, i=1,\dots,k$ can be determined from the list of the grouping factors themselves, stored as the \code{flist} slot, or from the dimensions $q_i,i=1,\dots,k$, stored in the \code{nc} slot, and the group pointers, stored in the \code{Gp} slot. Successive differences of the group pointers are the total number of components of $\bb$ associated with the $i$th grouping factor. That is, these differences are $n_i q_i,i=1,\dots,k$. The first element of the \code{Gp} slot is always 0. \subsection{Examples} \label{sec:OmegaExamp} Consider the fitted models <<fmi>>= Sm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy) data(Chem97, package = "mlmRev") Cm1 <- lmer(score ~ gcsescore + (1|school) + (1|lea), Chem97, control = list(niterEM = 0, gradient = FALSE)) data(star, package = "mlmRev") Mm1 <- lmer(math ~ gr+sx*eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(niterEM = 0, gradient = FALSE)) @ Model \code{Sm1} has a single grouping factor with $18$ levels. The \code{Omega} slot is a list of length one containing a $2\times 2$ symmetric matrix. <<Sm1grp>>= str(Sm1@flist) show(Sm1@Omega) show(Sm1@nc) show(Sm1@Gp) diff(Sm1@Gp)/Sm1@nc @ Model \code{Cm1} has two grouping factors: the \code{school} factor with 2410 levels and the \code{lea} factor (local education authority - similar to a school district in the U.S.A.) with 131 levels. It happens that the \code{school} factor is nested within the \code{lea} factor, a property that we discuss below. The \code{Omega} slot is a list of length two containing two $1\times 1$ symmetric matrices. <<Cm1grp>>= str(Cm1@flist) show(Cm1@Omega) show(Cm1@nc) show(Cm1@Gp) diff(Cm1@Gp)/Cm1@nc @ Model \code{Mm1} has three grouping factors: \code{id} (the student) with 10732 levels, \code{tch} (the teacher) with 1374 levels and \code{sch} (the school) with 80 levels. The \code{Omega} slot is a list of length three containing two $2\times 2$ symmetric matrices and one $1\times 1$ matrix. <<Mm1grp>>= str(Mm1@flist) show(Mm1@Omega) show(Mm1@nc) show(Mm1@Gp) diff(Mm1@Gp)/Mm1@nc @ The last element of the \code{Gp} slot is the dimension of $\bb$. Notice that for model \code{Mm1} the dimension of $\bb$ is 22,998. This is also the order of the symmetric matrix $\bOmega$ although the contents of the matrix are determined by $\btheta$ which has a length of $3+1+3=7$ in this case. Table~\ref{tbl:dims} summarizes some of the dimensions in these examples. \begin{table}[tb] \centering \begin{tabular}{r r r r r r r r r r r} \hline\\ \multicolumn{1}{c}{Name}&\multicolumn{1}{c}{$n$}& \multicolumn{1}{c}{$k$}& \multicolumn{1}{c}{$n_1$}&\multicolumn{1}{c}{$q_1$}& \multicolumn{1}{c}{$n_2$}&\multicolumn{1}{c}{$q_2$}& \multicolumn{1}{c}{$n_3$}&\multicolumn{1}{c}{$q_3$}& \multicolumn{1}{c}{$q$}&\multicolumn{1}{c}{$\#(\btheta)$}\\ \hline \code{Sm1}& 180&1& 18&2& & & & & 36&3\\ \code{Cm1}&31022&2& 2410&1& 131&1& & & 2541&2\\ \code{Mm1}&24578&3&10732&2&1374&1&80&2&22998&7\\ \hline \end{tabular} \caption{Summary of various dimensions of model fits used as examples.} \label{tbl:dims} \end{table} \subsection{Permutation of the random-effects vector} \label{sec:permutation} For most mixed-effects model fits, the model matrix $\bZ$ for the random effects vector $\bb$ is large and sparse. That is, most of the entries in $\bZ$ are zero (by design, not by accident). Numerical analysts have developed special techniques for representing and manipulating sparse matrices. Of particular importance to us are techniques for obtaining the left Cholesky factor $\bL(\btheta)$ of the large, sparse, positive-definite, symmetric matrices. In particular, we want to obtain the Cholesky factorization of $\bZ\trans\bZ+\bOmega(\btheta)$ for different values of $\btheta$. Sparse matrix operations are typically performed in two phases: a \emph{symbolic phase} in which the number of non-zero elements in the result and their positions are determined followed by a \emph{numeric phase} in which the actual numeric values are calculated. Advanced sparse Cholesky factorization routines, such as the CHOLMOD library (Davis, 2005) which we use, allow for calculation of a fill-reducing permutation of the rows and columns during the symbolic phase. In fact the CHOLMOD code allows for evaluation of both a fill-reducing permutation and a post-ordering that groups the columns of $\bL$ into blocks with identical row structure, thus allowing for the dense matrix techniques to be used on the blocks or ``super-nodes''. Such a decomposition is called a \emph{supernodal} Cholesky factorization. Because the number and positions of the nonzeros in $\bOmega(\btheta)$ do not change with $\btheta$ and the nonzeros in $\bOmega(\btheta)$ are a subset of the nonzeros in $\bZ\trans\bZ$, we only need to perform the symbolic phase once and we can do so using $\bZ\trans\bZ$. That is, using $\bZ$ only we can determine the permutation matrix $\bP$ for all decompositions of the form \begin{equation} \label{eq:Cholesky} \bP\left[\bZ\trans\bZ+\bOmega(\btheta)\right]\bP\trans = \bL(\btheta)\bL(\btheta)\trans \end{equation} We revise (\ref{eq:lmeGeneral}) by incorporating the permutation to obtain \begin{equation} \label{eq:lmePerm} \by=\bX\bbeta+\bZ\bP\trans\bP\bb+\beps\quad \beps\sim\mathcal{N}(\bzer,\sigma^2\bI), \bb\sim\mathcal{N}(\bzer,\sigma^2\bSigma), \beps\perp\bb \end{equation} \subsection{Extent of the sparsity} \label{sec:sparsity} Table~\ref{tbl:sparse} shows the extent of the sparsity of the matrices $\bZ$, $\bZ\trans\bZ$ and $\bL$ in our examples. The matrix $\bL$ is the supernodal representation of the left Cholesky factor of $\bP\left(\bZ\trans\bZ+\bOmega\right)\bP\trans$. Because the fill-reducing permutation $\bP$ has been applied the number of nonzeros in $\bL$ will generally be less than the number of nonzeros in the left Cholesky factor of $\bZ\trans\bZ+\bOmega$. However, when any supernodes of $\bL$ contain more than one column there will be elements above the diagonal of $\bL$ stored and these elements are necessarily zero. They are stored in the supernodal factorization so that the diagonal block for a supernode can be treated as a dense rectangular matrix. Generally storing these few elements from above the diagonal is a minor expense compared to the speed savings of using dense matrix operations and being about to take advantage of accelerated level-3 BLAS (Basic Linear Algebra Subroutines) code. Furthermore they are never used in computation because the diagonal block is treated as a densely stored lower triangular matrix. We mention this so the interested reader can understand in \code{Cm1}, for example, $\nz(\bZ\trans\bZ)$ is 54 ($=18\times 3$) yet $\nz(\bL)$ is 72 ($=18\times 4$). In this example there is no fill-in so the number of nonzeros in $\bL$ is actually 54 but the matrix is stored as 18 lower-triangular blocks that are each $2\times 2$. \begin{table}[tb] \centering \begin{tabular}{r r r r r r r r r r r} \hline\\ \multicolumn{1}{c}{Name}&\multicolumn{1}{c}{$n$}& \multicolumn{1}{c}{$k$}& \multicolumn{1}{c}{$q$}& \multicolumn{1}{c}{$\nz(\bZ)$}&\multicolumn{1}{c}{$\spr(\bZ)$}& \multicolumn{1}{c}{$\nz(\bZ\trans\bZ)$}&\multicolumn{1}{c}{$\spr(\bZ\trans\bZ)$}& \multicolumn{1}{c}{$\nz(\bL)$}&\multicolumn{1}{c}{$\spr(\bL)$}\\ \hline \code{Sm1}& 180&1& 36& 360&0.0556& 54&0.0811& 72&0.1081\\ \code{Cm1}&31022&2& 36& 360&0.0556& 54&0.0811& 72&0.1081\\ \hline \end{tabular} \caption{Summary of the sparsity of the model matrix $\bZ$, its crossproduct matrix $\bZ\trans\bZ$ and the left Cholesky factor $\bL$ in the examples. The notation $\nz{}$ indicates the number of nonzeros in the matrix and $\spr{}$ indicates the sparsity index (the fraction of the elements in the matrix that are non-zero). Because $\bZ\trans\bZ$ is symmetric only the nonzeros in the upper triangle are counted and the sparsity index is relative to the total number of elements in the upper triangle. The number of nonzeros and the sparsity index of $\bL$ are calculated for the supernodal decomposition, which can include some elements that are above the diagonal and hence must be zero.} \label{tbl:sparse} \end{table} \section{Likelihood and restricted likelihood} \label{sec:likelihood} In general the \emph{maximum likelihood estimates} of the parameters in a statistical model are those values of the parameters that maximize the likelihood function, which is the same numerical value as the probability density of $\by$ given the parameters but regarded as a function of the parameters given $\by$, not as a function of $\by$ given the parameters. For model (\ref{eq:lmePerm}) the parameters are $\bbeta$, $\sigma^2$ and $\btheta$ (as described in \S\ref{sec:relativePrecision}, $\btheta$ and $\sigma^2$ jointly determine $\bSigma$) so we evaluate the likelihood $L(\bbeta, \sigma^2, \btheta|\by)$ as \begin{equation} \label{eq:Likelihood1} L(\bbeta, \sigma^2, \btheta|\by) = f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta) \end{equation} where $f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$ is the marginal probability density for $\by$ given the parameters. Because we will need to write several different marginal and conditional probability densities in this section and expressions like $f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$ are difficult to read, we will adopt a convention sometimes used in the Bayesian inference literature that a conditional expression in square brackets indicates the probability density of the quantity on the left of the $|$ given the quantities on the right of the $|$. That is \begin{equation} \label{eq:Bayesnotation} \left[\by|\bbeta,\sigma^2,\btheta\right]=f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta) \end{equation} Model (\ref{eq:lmePerm}) specifies the conditional distributions \begin{equation} \label{eq:ycond} \left[\by|\bbeta,\sigma^2,\bb\right]= \frac{\exp\left\{-\|\by-\bX\bbeta-\bZ\bP\trans\bP\bb\|^2/\left(2\sigma^2\right)\right\}} {\left(2\pi\sigma^2\right)^{n/2}} \end{equation} and \begin{equation} \label{eq:bmarg} \begin{split} \left[\bb|\btheta,\sigma^2\right]&= \frac{\exp\left\{-\bb\trans\bSigma^{-1}\bb/2\right\}} {|\bSigma|^{1/2}\left(2\pi\right)^{q/2}}\\ &=\frac{|\bOmega|^{1/2}\exp\left\{-\bb\trans\bOmega\bb/\left(2\sigma^2\right)\right\}} {\left(2\pi\sigma^2\right)^{q/2}} \end{split} \end{equation} from which we can derive the marginal distribution \begin{equation} \label{eq:ymarg} \begin{split} \left[\bb|\btheta,\sigma^2\right]&= \int_{\bb}\left[\by|\bbeta,\sigma^2,\bb\right]\left[\bb|\btheta,\sigma^2\right]\,d\bb\\ \frac{|\bOmega|^{1/2}}{\left(2\pi\sigma^2\right)^{n/2}} \int_{\bb} \frac{\exp\left\{-\left[\|\by-\bX\bbeta-\bZ\bP\trans\bP\bb\|^2+\bb\trans\bOmega\bb\right] /\left(2\sigma^2\right)\right\}} {\left(2\pi\sigma^2\right)^{q/2}}\,d\bb \end{split} \end{equation} \bibliography{lme4} \end{document}
root@r-forge.r-project.org | ViewVC Help |
Powered by ViewVC 1.0.0 |