SCM Repository

[matrix] Diff of /branches/trunk-lme4/inst/doc/Implementation.Rnw
ViewVC logotype

Diff of /branches/trunk-lme4/inst/doc/Implementation.Rnw

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1210, Thu Jan 26 08:21:46 2006 UTC revision 1211, Tue Jan 31 00:25:18 2006 UTC
# Line 11  Line 11 
11    fontsize=\scriptsize}    fontsize=\scriptsize}
12  %%\VignetteIndexEntry{Implementation Details}  %%\VignetteIndexEntry{Implementation Details}
13  %%\VignetteDepends{Matrix}  %%\VignetteDepends{Matrix}
14    %%\VignetteDepends{lattice}
15  %%\VignetteDepends{lme4}  %%\VignetteDepends{lme4}
16  \begin{document}  \begin{document}
17  \SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,strip.white=TRUE}  \SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,strip.white=TRUE}
# Line 33  Line 34 
34  <<preliminaries,echo=FALSE,print=FALSE>>=  <<preliminaries,echo=FALSE,print=FALSE>>=
35  library(lattice)  library(lattice)
36  library(Matrix)  library(Matrix)
37  library(lme4)  #library(lme4)
38  data("Early", package = "mlmRev")  #data("Early", package = "mlmRev")
39  options(width=80, show.signif.stars = FALSE,  options(width=80, show.signif.stars = FALSE,
40          lattice.theme = function() canonical.theme("pdf", color = FALSE))          lattice.theme = function() canonical.theme("pdf", color = FALSE))
41  @  @
# Line 43  Line 44 
45  General formulae for the evaluation of the profiled log-likelihood and  General formulae for the evaluation of the profiled log-likelihood and
46  profiled log-restricted-likelihood in a linear mixed model are given  profiled log-restricted-likelihood in a linear mixed model are given
47  in \citet{bate:debr:2004} and the use of a sparse matrix  in \citet{bate:debr:2004}.  Here we describe a more general
48  representation for such models is described in \citet{bate:2004}.  The  formulation of the model using sparse matrix decompositions and the
49  purpose of this paper is to describe the details of the implementation  implementation of these methods in the \code{lmer} function for \RR{}.
 of this representation and those computational methods in the \code{lme4}  
 package for \RR{}.  
 Because we concentrate on the computational methods and the  
 representation, the order and style of presentation will be based on  
 the sequence of calculations, not on the sequence in which the results  
 would be derived.  We will emphasize ``what'' and not ``why''.  For  
 the ``why'', refer to the papers cited above.  
51  In \S\ref{sec:Form} we describe the form and representation of the  In \S\ref{sec:Form} we describe the form and representation of the
52  model.  The calculation of the criteria to be optimized by the  model.  The calculation of the criteria to be optimized by the
# Line 73  Line 66 
66    \label{eq:lmeGeneral}    \label{eq:lmeGeneral}
67    \by=\bX\bbeta+\bZ\bb+\beps\quad    \by=\bX\bbeta+\bZ\bb+\beps\quad
68    \beps\sim\mathcal{N}(\bzer,\sigma^2\bI),    \beps\sim\mathcal{N}(\bzer,\sigma^2\bI),
69    \bb\sim\mathcal{N}(\bzer,\sigma^2\bSigma^{-1}),    \bb\sim\mathcal{N}(\bzer,\bSigma),
70    \beps\perp\bb    \beps\perp\bb
71  \end{equation}  \end{equation}
72  where $\by$ is the $n$-dimensional response vector, $\bX$ is an  where $\by$ is the $n$-dimensional response vector, $\bX$ is an
73  $n\times p$ model matrix for the $p$ dimensional fixed-effects vector  $n\times p$ model matrix for the $p$ dimensional fixed-effects vector
74  $\bbeta$, $\bZ$ is the $n\times q$ model matrix for the  $\bbeta$, $\bZ$ is the $n\times q$ model matrix for the $q$
75  $q$ dimensional random-effects vector $\bb$ that has a Gaussian  dimensional random-effects vector $\bb$, which has a Gaussian
76  distribution with mean $\bzer$ and relative precision matrix $\bOmega$  distribution with mean $\bzer$ and variance-covariance matrix
77  (i.e., $\bOmega$ is the precision of $\bb$ relative to the precision  $\bSigma$, and $\beps$ is the random noise assumed to have a spherical
78  of $\beps$), and $\beps$ is the random noise assumed to have a  Gaussian distribution.  The symbol $\perp$ indicates independence of
79  spherical Gaussian distribution.  The symbol $\perp$ indicates  random variables.
 independence of random variables.  
81  We will assume that $\bX$ has full column rank and that $\bSigma$ is  We will assume that $\bX$ has full column rank and that $\bSigma$ is
82  positive definite.  positive definite.
84    \subsection{Structure of the variance-covariance matrix}
 \subsection{Structure of $\bSigma$}  
85  \label{sec:sigmaStructure}  \label{sec:sigmaStructure}
87  Components of the random effects vector $\bb$ and portions of its  Components of the random effects vector $\bb$ and portions of its
# Line 142  Line 133 
134  This called the ``relative'' precision because it is precision of  This called the ``relative'' precision because it is precision of
135  $\bb$ ($\bSigma^{-1}$) relative to the precision of $\beps$  $\bb$ ($\bSigma^{-1}$) relative to the precision of $\beps$
136  ($\bI/\left(\sigma^2\right)$).  ($\sigma^{-2}\bI$).
138  It is easy to establish that $\bOmega$ will have a repeated  It is easy to establish that $\bOmega$ will have a repeated
139  block/block diagonal structure like that of $\bSigma$.  That is,  block/block diagonal structure like that of $\bSigma$.  That is,
# Line 152  Line 143 
143  the inner diagonal blocks in the $i$th outer block is a copy of the  the inner diagonal blocks in the $i$th outer block is a copy of the
144  $q_i\times q_i$ positive-definite, symmetric matrix $\bOmega_i$.  $q_i\times q_i$ positive-definite, symmetric matrix $\bOmega_i$.
146  Because $\bOmega$ has a repeated block/block structure we define  Because $\bOmega$ has a repeated block/block structure we can define
147  the entire matrix if we specify the symmetric matrices $\bOmega_i, i =  the entire matrix by specifying the symmetric matrices $\bOmega_i, i =
148  1,\dots,k$ and, because of the symmetry, $\bOmega_i$ has at most  1,\dots,k$ and, because of the symmetry, $\bOmega_i$ has at most
149  $q_i(q_i+1)/2$ distinct elements.  We will write $\btheta$ for a  $q_i(q_i+1)/2$ distinct elements.  We will write $\btheta$ for a
150  parameter vector of length at most $\sum_{i=1}^{k}q_i(q_i+1)/2$ that  parameter vector of length at most $\sum_{i=1}^{k}q_i(q_i+1)/2$ that
# Line 172  Line 163 
163  pointers, stored in the \code{Gp} slot.  Successive differences of the  pointers, stored in the \code{Gp} slot.  Successive differences of the
164  group pointers are the total number of components of $\bb$ associated  group pointers are the total number of components of $\bb$ associated
165  with the $i$th grouping factor.  That is, these differences are $n_i  with the $i$th grouping factor.  That is, these differences are $n_i
166  q_i,i=1,\dots,k$.  q_i,i=1,\dots,k$.  The first element of the \code{Gp} slot is always 0.
169  \subsection{Examples}  \subsection{Examples}
# Line 188  Line 179 
179  Mm1 <- lmer(math ~ gr+sx*eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,  Mm1 <- lmer(math ~ gr+sx*eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
180              control = list(niterEM = 0, gradient = FALSE))              control = list(niterEM = 0, gradient = FALSE))
181  @  @
 Model \code{Sm1} has a single grouping factor with $18$ levels  
 \section{Likelihood and restricted likelihood}  
 given grouping factor have the same  variance-covariance matrix, say  
 For convenience we assume that components of $\bb$ are ordered  
 the components associated with grouping factor $\bbf_i$ are first,  
 then those associated with grouping $\bbf_2$ are next, and so on.  
 The random effects vector $\bb$ and the columns of $\bZ$ are divided  
 into $k$ outer blocks associated with grouping factors $\bff_i,  
 i=1,\dots,k$ in the data.  Because so many of the expressions that we  
 will use involve the grouping factors and quantities associated with  
 them, we will reserve the letter $i$ to index the grouping factors and  
 related quantities. We will not explicitly state the range of $i$,  
 which will always be $i=1,\dots,k$.  
 The outer blocks in $\bb$ and the columns of $\bZ$ are further  
 subdivided into $m_i$ inner blocks of size $q_i$ where $m_i$ is the  
 number of levels of $\bff_i$ (i.e.{} the number of distinct values in  
 $\bff_i$). Each grouping factor has associated with it an $n\times  
 q_i$ model matrix $\bZ_i$.  The full model matrix $\bZ$ is derived  
 from the grouping factors $\bff_i$ and these submodel matrices  
 Random effects in different outer blocks are independent.  Within each  
 of these blocks, the inner blocks of random effects are independent  
 and identically distributed with mean $\bzer$ and $q_i\times q_i$  
 variance-covariance matrix $\sigma^2\bOmega_i\inv$.  Thus $\bOmega$ is  
 block diagonal in $k$ blocks of size $m_i q_i\times m_i q_i$ and each  
 of these blocks is itself block diagonal in $m_i$ blocks, each of  
 which is the positive definite symmetric $q_i\times q_i$ matrix  
 $\bOmega_i$.  We call this a \emph{repeated block/block diagonal}  
 \subsection{Model specification and representation}  
 Like most model fitting functions in the S language, the \code{lme}  
 function in the \code{lme4} package uses a formula/data specification.  
 The formula specifies how to evaluate the response $\by$ and the  
 fixed-effects model matrix $\bX$ in the data frame provided as the  
 \code{data} argument.  An additional argument named \code{random}  
 specifies the names of the grouping factors and the formulae for  
 evaluation of the model matrices $\bZ_i$.  
 Standard optional arguments, such as \code{na.action} and  
 \code{subset}, are passed through to the \code{model.frame} function  
 that returns the model frame in which to evaluate all the formulae of  
 the model.  The \code{drop.unused.levels} optional argument is set to  
 \code{TRUE} so that unused levels in any factors are dropped during  
 creation of the model frame.  Thus there is no ambiguitiy regarding  
 the number of levels $m_i$ of each of the $\bff_i$.  Every level in  
 each of the $\bff_i$ must occur at least once in the model frame.  
 \subsubsection{Pairwise crosstabulation}  
 We begin by ordering the grouping factors so that $m_1\ge m_2\ge \dots  
 \ge m_k$ and, if $k>1$, forming their \emph{pairwise crosstabulation}.  
 This is the crossproduct matrix, $\bT=\tilde{\bZ}\trans\tilde{\bZ}$,  
 for the \emph{variance components model} determined by these grouping  
 factors.  (In a variance components model $q_1=q_2=\dots=q_k=1$ and  
 each of the $\tilde{\bZ}_i$ is the $n\times 1$ matrix of $1$s.  The  
 $i$th outer block of columns in $\tilde{\bZ}$ is the set of indicator  
 columns for the levels of $\bff_i$.  The name ``variance components''  
 reflects the fact that, in this model, each of the $\bOmega_i$, which  
 are scalars, is the relative precision of the component of the  
 variation in the response attributed to factor $\bff_i$.)  
 In fact it is not necessary to form and store $\bT$.  All that we need  
 are the locations of the nonzeros in the off-diagonal blocks of the  
   \end{bmatrix} .  
 (Because the $i$th outer block of columns in $\tilde{\bZ}$ is a set of  
 indicators, the diagonal blocks will themselves be diagonal and their  
 patterns of nonzeros are trivial.)  The blocks in the strict lower  
 triangle, $\bT_{(i,j)}, i>j$ are stored as a list of compressed,  
 sparse, column-oriented matrices (see Appendix~\ref{app:Sparse} for  
 details).  Only the column pointers and the row indices of these  
 sparse matrices are used..  
 These off-diagonal blocks are easily calculated because the integer  
 representations of the factors $\bff_i$ and $\bff_j$ form the row and  
 column indices in a (possibly redundant) triplet representation of  
 $\bT_{(i,j)}$.  All that we need to do is convert the triplet  
 representation to the compressed, sparse, column-oriented  
 representation.  This common conversion is easily accomplished with  
 standard software.  
 We check whether each of the matrices $\bT_{(j,j-1)},j=2,\dots,k$ has  
 exactly one nonzero in each column.  If so, the grouping factors form  
 a \emph{nested sequence} in that each level of factor $\bff_i$ occurs  
 with exactly one level of factor $\bff_j, i<j\le k$.  In the  
 compressed, sparse, column-oriented format it is easy to determine the  
 number of nonzeros in each column because these are the successive  
 differences in the column pointers.  
 If the grouping factors form a nested sequence (and a single grouping  
 factor is trivially a nested sequence) there is no need for further  
 symbolic analysis.  If not, which is to say that we have multiple,  
 non-nested grouping factors, we continue the symbolic analysis for the  
 unit lower triangular factor $\tL$ in the LDL form of the Cholesky  
 decomposition of $\bT$~\citep{Davis:2004}.  
 This symbolic analysis is performed on rows of blocks in the lower  
 triangle of $\tL$, where the blocks of $\tL$ correspond to those  
 of $\bT$  
   \end{bmatrix} .  
 We emphasize that we are only determining the positions of the  
 nonzeros in the blocks of $\tL$, not performing an actual  
 decomposition.  (The decomposition would fail for $k>1$ because $\bT$  
 is only positive semidefinite, not positive definite.  It is composed  
 of multiple blocks of indicators and the row sums within each block  
 are always unity.)  
 Because $\bT_{(1,1)}$ is diagonal, the block in the $(1,1)$ position  
 of $\tL$ will be both diagonal and unit, which is to say that it is  
 the identity matrix of size $m_1$.  A consequence of the $(1,1)$ block  
 being the identity is that the structure of the first column of blocks  
 of $\tL$ is the same as the corresponding block of $\bT$.  That is,  
 the nonzeros in $\tL_{(j,1)}$ are in the same positions as those in  
 $\bT_{(j,1)}$ for $1<j\le k$.  
 The off-diagonal nonzero positions in $\tL_{(2,2)}$ are determined  
 from the union of the off-diagonal nonzeros positions in  
 $\bT_{(2,2)}$, of which there are none, and those in  
 $\tL_{(2,1)}\tL_{(2,1)}\trans=\bT_{(2,1)}\bT_{(2,1)}\trans$.  The  
 number of nonzeros in $\tL_{(2,2)}$ can be changed by permuting the  
 levels of $\bff_2$, which causes a permutation of the rows of the  
 blocks $\bT_{(2,i)}$ and the columns of the blocks $\bT_{(i,2)}$.  We  
 determine a fill-reducing permutation of the levels of $\bff_2$ using  
 routines from the Metis~\citep{Metis} graph-partitioning package  
 applied to the incidence graph of $\bT_{(2,1)}\bT_{(2,1)}\trans$.  
 (This is the graph of $m_2$ nodes in which nodes $s$ and $t$ are  
 connected by an edge if and only if  
 $\left\{\bT_{(2,1)}\bT_{(2,1)}\trans\right\}_{s,t}$ is nonzero.)  We  
 apply this permutation to the columns of all blocks $\bT_{(i,2)}$ and  
 the rows of all blocks $\bT_{(2,i)}$.  
 The symbolic analysis function from the LDL package applied to  
 $\bT_{(2,1)}\bT_{(2,1)}\trans$ (after permuting the rows of  
 $\bT_{(2,1)}$) provides the positions of the nonzeros in  
 $\tL_{(2,2)}$, from which we determine the positions of the nonzeros  
 in $\tL_{(2,2)}\inv$.  The positions of the nonzeros in $\tL_{(3,2)}$  
 are those of $\bT_{(3,2)}\tL_{(2,2)}\inv$.  The next step in the  
 iteration is to form the union of the nonzero positions of  
 $\tL_{(3,2)}\tL_{(3,2)}\trans$ and the nonzero positions in  
 $\bT_{(3,1)}\bT_{(3,1)}\trans$ from which we determine a fill-reducing  
 permutation for the levels of $\bff_3$.  The process is continued  
 until a fill-reducing permutation for the levels of $\bff_k$ and the  
 structure of $\tL_{(k,k)}$ and $\tL_{(k,k)}\inv$ are determined.  
 As part of the symbolic analysis of each diagonal block, the  
 elimination tree~\citep{Davis:2004} of the block is determined and  
 stored as an integer array of length $m_i$.  It is straightforward to  
 check the number of terminal nodes in this tree.  If the elmination  
 tree for $\tL_{(i,i)}$ is found to have only one terminal node then  
 $\tL_{(i,i)}\inv$ will be dense. Furthermore, all subsequent diagonal  
 blocks will be dense so there is no purpose in checking for a  
 fill-reducing permutation of the levels of $\bff_j,i<j\le k$, and the  
 symbolic analysis can be terminated.  
 \subsubsection{Allocating storage}  
 In the numeric representation of the model we will write the augmented  
 model matrix of size $n\times(p+1)$ obtained by appending $\by$ to  
 $\bX$ as $\tilde{\bX}=\left[\bX,\by\right]$.  We can allocate the  
 storage for the sparse matrix representation of the model using the  
 results of the symbolic analysis and the numbers of columns in the  
 model matrices, $q_i,i=1,\dots,k$ and $(p+1)$.  
 We will store $\bOmega$, $\bZ\trans\bZ$, $\bZ\trans\tilde{\bX}$,  
 $\tilde{\bX}\trans\tilde{\bX}$ and components $\RZZ$, $\tRZX$ and $\tRXX$ of  
 the Cholesky decomposition  
     \bZ\trans\bZ+\bOmega & \bZ\trans\tilde{\bX} \\  
     \tilde{\bX}\trans\bZ & \tilde{\bX}\trans\tilde{\bX}  
     \RZZ & \tRZX\\  
     \bzer& \tRXX  
   \end{bmatrix} .  
 in one of four possible formats: as a dense matrix, as a block/block  
 sparse matrix, as a block/block diagonal matrix, or as a repeated  
 block/block diagonal matrix.  
 For example, we have already indicated that $\bOmega$ is repeated  
 block/block diagonal so we store it in the \code{Omega} slot as a list  
 of $k$ symmetric matrices of sizes $q_i\times q_i$ (only the upper  
 triangles of the symmetric matrices are used).  
 The matrices $\bZ\trans\tilde{\bX}$ and  
 $\tilde{\bX}\trans\tilde{\bX}$, and the decomposition components  
 $\tRZX$, and $\tRXX$ matrices are stored as dense matrices in slots  
 named \code{ZtX}, \code{XtX}, \code{RZX} and \code{RXX}, respectively.  
 The matrix $\bZ\trans\bZ$ is stored in a symmetric, sparse  
 column-oriented format like that of $\bT$ in  
 (\ref{eq:PairwiseCrosstab}) except that $\bZ\trans\bZ$ has both inner  
 and outer blocks.  The $k\times k$ grid of outer blocks  
 $(\bZ\trans\bZ)_{(i,j)}$ is determined by the grouping factors.  These  
 outer blocks correspond to the blocks $\bT_{(i,j)}$ in $\bT$.  The  
 diagonal outer block $(\bZ\trans\bZ)_{(i,i)}$ is itself block diagonal  
 in $m_i$ blocks of size $q_i\times q_i$.  We store the upper triangles  
 of these inner blocks in an array of size $q_i\times q_i\times m_i$.  
 Block $(\bZ\trans\bZ)_{(i,j)}$ for $j<i\le k$ is subdivided into inner  
 blocks of size $q_i\times q_j$ corresponding to the levels of grouping  
 factors $i$ and $j$.  Because an inner block of  
 $(\bZ\trans\bZ)_{(i,j)}$ is nonzero if and only if the corresponding  
 element of $\bT_{(i,j)}$ is nonzero, we can use the column pointers  
 and row indices from $\bT_{(i,j)}$ for $(\bZ\trans\bZ)_{(i,j)}$ with  
 the convention that they index the inner blocks, not the individual  
 elements, of $(\bZ\trans\bZ)_{(i,j)}$.  The inner blocks are stored in  
 a dense array of dimension $q_i\times q_j\times n_{(i,j)}$ where  
 $n_{(i,j)}$ is the number of nonzeros in $\bT_{(i,j)}$.  This is the  
 block/block sparse matrix format.  
 Instead of calculating $\RZZ$ we calculate the LDL form  
 where $\bL$ is block/block unit lower triangular (i.e.{} block/block lower  
 triangular with all the inner diagonal blocks in the $i$th outer  
 diagonal block being the $q_i\times q_i$ identity matrix) and $\bD$ is  
 block/block diagonal.  
 Just as the column pointers and row indices of the blocks of $\bT$ can  
 be used for the outer blocks of $\bZ\trans\bZ$, we can use the column  
 pointers and row indices of $\tL$, which we have determined in the  
 symbolic analysis, for the outer blocks of  
     \bI & \bzer & \dots & \bzer\\  
     \bL_{(2,1)} & \bL_{(2,2)} & \dots & \bzer\\  
     \vdots    & \vdots & & \bzer\\  
     \bL_{(k,1)} & \bL_{(k,2)} & \dots & \bL_{(k,k)}  
   \end{bmatrix} .  
 That is, the same structure as $\tL_{(i,j)}$  
 applies to the blocks $\bL_{(i,i)}, 1<i\le k$, $\bL_{(i,i)}, 1<i\le k$,  
 and $\bL_{(i,j)}, 1\le j<i\le k$ except that each nonzero in  
 $\tL_{(i.j)}$ corresponds to a block of size $q_i\times q_j$ in  
 From the symbolic analysis we also have the column pointers and row  
 indices for $\tL^{-1}_{(i,i)}$, corresponding to the diagonal outer  
 blocks of $\bL^{-1}$.  We allocate storage for these diagonal outer  
 blocks only.  Note that $\bL^{-1}$ is block/block unit lower  
 triangular just like $\bL$.  Because the inner diagonal blocks of  
 these matrices are always the identity, these inner blocks are not  
 explicitly stored.  Neither $\bL_{(1,1)}=\bI$ nor  
 $\left(\bL^{-1}\right)_{(1,1)}=\bI$ require any storage to be  
 We have used fill-reducing permutations of the levels of the grouping  
 factors $\bff_j,j=2,\dots,k$.  These can decrease, sometimes  
 dramatically, the amount of storage required for the $\tL_{(i,i)}$ but  
 generally they do not result in compact storage of  
 $\tL_{(i,i)}^{-1}$.  In the worst case the matrix $\tL_{(2,2)}^{-1}$  
 will be dense or nearly dense, resulting in a storage requirement of  
 approximately $q_2^2 m_2(m_2+1)/2$ elements for the array holding the  
 numeric values for the $(2,2)$ outer block of $\bL^{-1}$.  
 The matrix $\bD$ is block/block diagonal. It is stored as a list of  
 $k$ arrays of sizes $q_i\times q_i\times m_i$.  
 After allocating the storage we evaluate $\by$, $\bX$ and the $\bZ_i$  
 then update the contents of the \code{XtX}, \code{ZtX}, and \code{ZtZ}  
 slots.  We allocate the storage and update the contents of the storage  
 in separate steps so we can update the numeric values without  
 reallocating storage during iterative algorithms for fitting  
 generalized linear mixed models or nonlinear mixed models.  
 \section{Evaluation of the objective}  
 If prior estimates of the $\bOmega_i$ are available, we set the  
 \code{Omega} slot accordingly.  Otherwise we form initial estimates  
 from the matrices $\bZ_i$, as described in  
 \citet[ch.~3]{pinh:bate:2000}.  We then begin iterative optimization  
 of the estimation criterion with respect to the $\bOmega_i$ or with  
 respect to the unconstrained parameter $\btheta$ that determines the  
 $\bOmega_i$, as described in \S\ref{sec:Unconstrained}.  
 In this section we will describe the steps in evaluating the objective  
 function (i.e.{} the function to be optimized w.r.t.{} the $\bOmega_i$  
 or w.r.t.{} $\btheta$) given the current values of the $\bOmega_i$.  
 Recall that after setting values of the $\bOmega_i$ we form the  
 Cholesky decomposition (\ref{eq:tildeCrossProdGen})  
     \bZ\trans\bZ+\bOmega & \bZ\trans\tilde{\bX} \\  
     \tilde{\bX}\trans\bZ & \tilde{\bX}\trans\tilde{\bX}  
     \RZZ & \tRZX\\  
     \bzer& \tRXX  
   \end{bmatrix} .  
 and $\tilde{\bX}$ incorporates both $\bX$ and $\by$.  In some cases  
 it will be convenient to consider $\bX$ and $\by$ separately and we will  
 write the components of the Cholesky decomposition as if they were  
     \bZ\trans\bZ+\bOmega & \bZ\trans\bX  & \bZ\trans\by \\  
     \bX\trans\bZ         & \bX\trans\bX  & \bX\trans\by \\  
     \by\trans\bZ         & \by\trans\bX  & \by\trans\by  
     \RZZ & \RZX & \rZy \\  
     \bzer    & \RXX & \rXy \\  
     \bzer    & \bzer    & \ryy  
 even though they are stored in the form of (\ref{eq:tildeCrossProdGen}).  
 Recall also that $\RZZ$ is calculated and stored in the LDL form.  
 Instead of storing the symmetric positive definite inner blocks of the  
 block/block diagonal $\bD$, we calculate and store their upper  
 Cholesky factors.  We write the block/block diagonal matrix of upper  
 Cholesky factors as $\bD^{1/2}$ and its transpose as  
 $\bD^{\mathsf{T}/2}$.  (Note that transposing a block/block diagonal  
 matrix only requires transposing the inner blocks.)  The quantity  
 $\log\left|\bD\right|$ is evaluated as the sum of the logarithms of  
 the squares of the diagonal elements of the inner diagonal blocks of  
 The next step in the factorization is to solve for $\tRZX$ in  
 using blocked sparse matrix techniques (see  
 Appendix~\ref{app:blocked}), storing the result in the \code{RZX}  
 slot.  Finally, dense matrix operations are used to downdate the  
 densely stored $\tilde{\bX}\trans\tilde{\bX}$ and obtain the upper  
 Cholesky factor $\tRXX$ that satisfies  
   \tilde{\bX}\trans\tilde{\bX}-\tRZX\trans\tRZX ,  
 and provides $\log\left(\left|\RXX\right|^2\right)$ and $\ryy$.  
 The \code{status} slot is a pair of logical values called  
 \code{factored} and \code{inverted}.  It is set to \code{(TRUE,  
   FALSE)} indicating that the factorization is current and that it has  
 not been inverted.  
 In a separate calculation the logarithm of the determinant  
   \log |\bOmega|=\sum_{i=1}^k m_i \log |\bOmega_i|  
 is evaluated from the Cholesky factors of the $\bOmega_i$.  The  
 function to be minimized w.r.t.{} $\btheta$, called the \emph{profiled  
   deviance}, is  
   + n\left[1+\log\left(\frac{2\pi\ryy^2}{n}\right)\right]  
 for ML estimation or  
   - \log\left|\bOmega\right|  
   + (n-p)\left[1+\log\left(\frac{2\pi\ryy^2}{n-p}\right)\right]  
 for REML.  
 \subsection{Inverting the factorization}  
183  Evaluatation of the objective function does not require inversion of  Model \code{Sm1} has a single grouping factor with $18$ levels.  The
184  the factorization nor does evaluation of many other quantities of  \code{Omega} slot is a list of length one containing a $2\times 2$
185  interest, such as the conditional variance estimates  symmetric matrix.
186  $\widehat{\sigma^2}=\ryy^2/n$ and $\widehat{\sigma}_R^2=\ryy^2/(n-p)$,  <<Sm1grp>>=
187  the conditional fixed effects estimates, $\widehat{\bbeta}$, which satisfy  str(Sm1@flist)
188  $\RXX\widehat{\bbeta}=\rXy$,  show(Sm1@Omega)
189  and the conditional modes of  show(Sm1@nc)
190  the random effects, $\widehat{\bb}$, which satisfy  show(Sm1@Gp)
191  \begin{equation}  diff(Sm1@Gp)/Sm1@nc
 However, if we wish to evaluate the ECME step or the gradient and/or  
 the Hessian of the objective, it is convenient to invert some parts of  
 the decomposition.  We invert the dense upper triangular matrix  
 $\tRXX$ in place, producing $\RXX^{-1}$ in the first $p$ rows and  
 columns, $1/\ryy$ in the $(p+1,p+1)$ position, and  
 $-\widehat{\bbeta}/\ryy$ in the first $p$ rows of the $p+1$st column.  
 We also invert each of the triangular inner blocks of $\bD^{1/2}$ in  
 place producing $\bD^{-1/2}$.  The inverses of the outer blocks on the  
 diagonal of $\bL$ are also determined and stored separately.  (In  
 general the diagonal outer blocks cannot be inverted in place.  The  
 number of nonzero inner blocks in the inverse of an outer block on the  
 diagonal can be different from the number of nonzero inner blocks in  
 the outer diagonal block itself.)  Notice that this last operation  
 is trivial in the case of a single grouping factor or multiple, nested  
 grouping factors because all the outer diagonal blocks in $\bL$ are  
 identity matrices.  
 The matrix $\tRZX{\tRXX}^{-1}$ is evaluated as a dense matrix product  
 then each column of  
 $-\bL\invtrans\bD^{-1/2}\left(\tRZX{\tRXX}^{-1}\right)$ is evaluated  
 using blocked sparse matrix operations (see Appendix~\ref{app:blocked}  
 for details) and stored in the \code{RZX} slot.  The first $p$ columns  
 of the result, which are $-\bL\invtrans\bD^{-1/2}\RZX\RXX^{-1}$, are  
 used to create products involving the matrix $\vb$, which is the  
 unscaled, conditional, REML variance-covariance of the random effects  
     \vb&= \begin{bmatrix}\bI&\bzer\end{bmatrix}  
       \begin{bmatrix}\RZZ\trans & \bzer \\ \RZX\trans & \RXX\trans\end{bmatrix}  
       \begin{bmatrix}\RZZ & \RZX \\ \bzer & \RXX\end{bmatrix}  
     \begin{bmatrix} \bI\\\bzer \end{bmatrix}\\  
     &= \begin{bmatrix}\bI&\bzer\end{bmatrix}  
     \begin{bmatrix} \bI\\\bzer \end{bmatrix}\\  
 The $p+1$st column is $-\widehat{\bb}/\ryy$.  
 The \code{status} flag is changed to \code{(TRUE, TRUE)} indicating  
 that the factorization is current and that it is the inverted  
 components that are stored in the \code{RZX}, \code{RXX}, and  
 \code{Dfac} slots and that the \code{LIx} slot if current.  
 To illustrate this representation and these operations we consider  
 some examples of common types of mixed effects models.  
 \subsection{Single grouping factor}  
 The \code{Early} data set in the \code{lme4} package is from a  
 longitudinal study on an early childhood intervention program.  These  
 data are described in \citet[ch.~3]{Sing:Will:2003}.  The response is  
 a cognitive development score, \code{cog}, measured at ages 1, 1.5,  
 and 2 years (\code{age}) on 103 infants (identified by \code{id}); 58  
 of whom were in the treatment group and 45 in the control group  
 (identified by \code{trt}).  We convert the \code{age} measurement to  
 ``time on study'', \code{tos} as the treatment began at age 0.5 years.  
 Early$tos <- Early$age - 0.5  
 print(xyplot(cog ~ age | id, Early, type = 'b', aspect = 'xy',  
              layout = c(23,5), between = list(y = c(0,0,0.5)),  
              skip = rep(c(0,1),c(58,11)),  
              xlab = "Age (yr)",  
              ylab = "Cognitive development score",  
              scales = list(x = list(tick.number = 3, alternating = TRUE,  
                            labels = c("1","","2"), at = c(1,1.5,2))),  
              par.strip.text = list(cex = 0.7)))  
192  @  @
193  A lattice plot of the longitudinal scores is given in Figure~\ref{fig:EarlyData}.  
194  \begin{figure}[tbp]  Model \code{Cm1} has two grouping factors: the \code{school} factor
195    \centering  with 2410 levels and the \code{lea} factor (local education authority
196    \includegraphics[width=\textwidth]{figs/Example-EarlyData}  - similar to a school district in the U.S.A.) with 131 levels.  It
197    \caption[Cognitive development scores]{Cognitive development scores  happens that the \code{school} factor is nested within the \code{lea}
198      for 103 infants; 45 in the control group (identification numbers  factor, a property that we discuss below. The \code{Omega} slot is a
199      above 900) and 58 in the treatment group. The data from  list of length two containing two $1\times 1$ symmetric matrices.
200      each infant are displayed in separate panels. Those for the  <<Cm1grp>>=
201      infants in the treatment group are in the lower three rows of  str(Cm1@flist)
202      panels; the control group are in the upper two rows.  Within each  show(Cm1@Omega)
203      group the panels are ordered by increasing initial cognitive  show(Cm1@nc)
204      score.}  show(Cm1@Gp)
205    \label{fig:EarlyData}  diff(Cm1@Gp)/Cm1@nc
 These data are grouped by subject, as is common in longitudinal data.  
 The only grouping factor for random effects is \code{id}.  (The  
 subjects are themselves grouped into a treatment group and a control  
 group but that dichotomy would be modeled with fixed effects, not  
 random effects.)  As an initial model an analyst may fit an  
 \emph{unconditional growth} model~\cite[ch.~4]{Sing:Will:2003}  
 fm1 <- lmer(cog ~ tos + (tos|id), Early)  
 or proceed immediately to a model that allows for the effects of the  
 fm2 <- lmer(cog ~ tos * trt + (tos|id), Early)  
206  @  @
208  The structure of the resulting object is  Model \code{Mm1} has three grouping factors: \code{id} (the student)
209  <<EarlyTrtStr>>=  with 10732 levels, \code{tch} (the teacher) with 1374 levels and
210  str(fm2)  \code{sch} (the school) with 80 levels. The \code{Omega} slot is a
211    list of length three containing two $2\times 2$ symmetric matrices and
212    one $1\times 1$ matrix.
213    <<Mm1grp>>=
214    str(Mm1@flist)
215    show(Mm1@Omega)
216    show(Mm1@nc)
217    show(Mm1@Gp)
218    diff(Mm1@Gp)/Mm1@nc
219  @  @
221  \section{Frechet derivatives}  The last element of the \code{Gp} slot is the dimension of $\bb$.
222  \label{sec:Frechet}  Notice that for model \code{Mm1} the dimension of $\bb$ is 22,998.
223    This is also the order of the symmetric matrix $\bOmega$ although the
224    contents of the matrix are determined by $\btheta$ which has a length
225    of $3+1+3=7$ in this case.
227  We will denote the Frechet derivative of the matrix $\bOmega$ with  Table~\ref{tbl:dims} summarizes some of the dimensions in these examples.
 respect to $\bOmega_i$ by $\der_{\bOmega_i}\bOmega$.  This is a four  
 dimensional array that can be regarded as an indexed set of $q_i^2$  
 matrices, each the same size as $\bOmega$.  These matrices, which we  
 call the $q_i^2$ ``faces'' of the array, are indexed by the rows $r$  
 and the columns $c$, $r,c=1,\dots,q_i$ of $\bOmega_i$.  The $(r,c)$th  
 face, which we write as $\der_{i:r,c}\bOmega$, is the derivative of  
 $\bOmega$ with respect to the $(r,c)$th element of $\bOmega_i$. It is  
 a repeated block/block diagonal matrix where all of the outer diagonal  
 blocks are zero except for the $(i,i)$ diagonal block which is block  
 diagonal in $m_i$ blocks of $\bbe_r\bbe_c\trans$, where $\bbe_j$ is  
 the $j$th column of the identity matrix of size $q_i$.  
 An expression of the form  
 $\tr\left[\left(\der_{\bOmega_i}\bOmega\right)\bM\right]$ where $\bM$  
 is a matrix of the same size as $\bOmega$ evaluates to $q_i^2$ scalars  
 indexed by $r$ and $c$, $r,c=1,\dots,q_i$, which we convert to a  
 $q_i\times q_i$ matrix in the obvious way.  This type of expression  
 can be simplified as  
   = \sum_{j=1}^{m_i}\tr\left[(\der_{\bOmega_i}\bOmega_i)\bM_{i,i,j,j}\right]  
   = \sum_{j=1}^{m_i}\bM_{i,i,j,j}\trans  
 where $\bM_{i,i,j,j}$ is the $j$th inner diagonal block in the $i$th  
 outer diagonal block of $\bM$.  To establish the last equality in  
 (\ref{eq:derOmegai}) note that for any $q_i\times q_i$ matrix $\bA$  
 the entry in row $r$ and column $c$ of  
 $\tr\left[(\der_{\bOmega_i}\bOmega_i)\bA\right]$ is  
   = \tr\left[\bbe_r\bbe_c\trans\bA\right]  
   = \tr\left[\bbe_c\trans\bA\bbe_r\right]  
   = \bbe_c\trans\bA\bbe_r  
   = \bA_{c,r}  
 If $\bM$ is symmetric, as is the case in all the expressions of this  
 form that we consider, then  
 $\tr\left[\left(\der_{\bOmega_i}\bOmega\right)\bM\right]$ will be  
229  Expressions of this form that we will require include  \begin{table}[tb]
230  \begin{equation}    \centering
231    \label{eq:derOmegaInv}    \begin{tabular}{r r r r r r r r r r r}
232    \begin{aligned}      \hline\\
233    \tr\left[\left(\der_{\bOmega_i}\bOmega\right)\left(\bOmega\right)^{-1}\right]      \multicolumn{1}{c}{Name}&\multicolumn{1}{c}{$n$}& \multicolumn{1}{c}{$k$}&
234      &= \sum_{j=1}^{m_i}\left(\bOmega_{i,i,j,j}\right)\inv\\      \multicolumn{1}{c}{$n_1$}&\multicolumn{1}{c}{$q_1$}&
235      &= m_i\bOmega_i\inv      \multicolumn{1}{c}{$n_2$}&\multicolumn{1}{c}{$q_2$}&
236    \end{aligned},      \multicolumn{1}{c}{$n_3$}&\multicolumn{1}{c}{$q_3$}&
237  \end{equation}      \multicolumn{1}{c}{$q$}&\multicolumn{1}{c}{$\#(\btheta)$}\\
238  \begin{equation}      \hline
239    \label{eq:bbDer}      \code{Sm1}&  180&1&   18&2&    & &  & &   36&3\\
240    \tr\left[\left(\der_{\bOmega_i}\bOmega\right) \left(      \code{Cm1}&31022&2& 2410&1& 131&1&  & & 2541&2\\
241        \frac{\widehat{\bb}}{\ryy} \frac{\widehat{\bb}}      \code{Mm1}&24578&3&10732&2&1374&1&80&2&22998&7\\
242        {\ryy} \trans\right)\right] = \bB_i \bB_i\trans      \hline
243  \end{equation}    \end{tabular}
244  where $\bB_i$ is the $q_i\times m_i$ matrix whose $j$ column is    \caption{Summary of various dimensions of model fits used as
245  $\widehat{\bb}_{i,j}/\ryy, j=1,\dots,m_i$ (recall that these vectors      examples.}
246  are in the $p+1$st column of the \code{RZX} slot after the inversion    \label{tbl:dims}
247  step), and $\tr \left[ \left( \der_{\bOmega_i} \bOmega \right)  \end{table}
248    (\bZ\trans\bZ+\bOmega)^{-1} \right]$.  
249    \subsection{Permutation of the random-effects vector}
250  The last term requires evaluation of the  \label{sec:permutation}
251  inner diagonal blocks of  
252  \begin{equation}  For most mixed-effects model fits, the model matrix $\bZ$ for the
253    \label{eq:ZtZinv}  random effects vector $\bb$ is large and sparse.  That is, most of the entries
254    \left(\bZ\trans\bZ +\bOmega\right)^{-1}=\left(\RZZ\trans\RZZ\right)\inv=  in $\bZ$ are zero (by design, not by accident).
255    =\bL\invtrans\bD^{-1/2}\bD^{-\mathsf{T}/2}\bL^{-1}  
256  \end{equation}  Numerical analysts have developed special techniques for representing
257  When $k=1$, $\bL$ is an identity  and manipulating sparse matrices. Of particular importance to us are
258  matrix and the result is simply  techniques for obtaining the left Cholesky factor $\bL(\btheta)$ of
259  \begin{equation*}  the large, sparse, positive-definite, symmetric matrices.  In
260    \tr\left[\left(\der_{\bOmega_i}\bOmega\right)  particular, we want to obtain the Cholesky factorization of
261      \bD^{-1/2}\bD^{-\mathsf{T}/2}\right]=  $\bZ\trans\bZ+\bOmega(\btheta)$ for different values of $\btheta$.
262    \sum_{j=1}^{m_1}\bD_{1:j}^{-1/2}\bD_{1:j}^{-\mathsf{T}/2} .  
263  \end{equation*}  Sparse matrix operations are typically performed in two phases: a
264  When $k>1$ we must evaluate the crossproduct of each of the $m_i$  \emph{symbolic phase} in which the number of non-zero elements in the
265  inner blocks of $q_i$ columns in the $i$th outer block of columns of  result and their positions are determined followed by a \emph{numeric
266  $\bD^{-\mathsf{T}/2}\bL^{-1}=\RZZ\invtrans$, which we do using blocked, sparse    phase} in which the actual numeric values are calculated.  Advanced
267  matrix techniques.  See Appendix~\ref{ssec:InverseBlocks} for details.  sparse Cholesky factorization routines, such as the CHOLMOD library
268    (Davis, 2005) which we use, allow for calculation of a fill-reducing
269  For REML results we also need  permutation of the rows and columns during the symbolic phase.  In
270  \begin{multline}  fact the CHOLMOD code allows for evaluation of both a fill-reducing
271    \label{eq:Vbterm}  permutation and a post-ordering that groups the columns of $\bL$ into
272    \tr\left[\left(\der_{\bOmega_i}\bOmega\right)\vb\right] = \\  blocks with identical row structure, thus allowing for the dense
273    \tr\left[\left(\der_{\bOmega_i}\bOmega\right)  matrix techniques to be used on the blocks or ``super-nodes''.  Such a
274      \RZZ\inv\RZZ\invtrans\right]  decomposition is called a \emph{supernodal} Cholesky factorization.
275    + \tr\left[\left(\der_{\bOmega_i}\bOmega\right)  
276      \RZZ\inv\RZX\RXX\inv\RXX\invtrans\RZX\trans\RZZ\invtrans\right]  Because the number and positions of the nonzeros in $\bOmega(\btheta)$
277  \end{multline}  do not change with $\btheta$ and the nonzeros in $\bOmega(\btheta)$
278  We have just described how to calculate the first term in  are a subset of the nonzeros in $\bZ\trans\bZ$, we only need to
279  (\ref{eq:Vbterm}).  The second term is evaluated as the crossproducts  perform the symbolic phase once and we can do so using $\bZ\trans\bZ$.
280  of the $m_i$ inner blocks of $q_i$ rows in the $i$th outer block of the rows  That is, using $\bZ$ only we can determine the permutation matrix
281  of $-\RZZ\inv\RZX\RXX\inv$, which is calculated and stored in the  $\bP$ for all decompositions of the form
 \code{RZX} slot during the inversion of the factorization, as  
 described in \S\ref{ssec:Inverting}.  
 \section{ECME updates}  
 ECME iterations provide a stable, but only linearly convergent,  
 optimization algorithm for the objective functions $f$ and $f_R$.  We  
 use a moderate number (the default is 20) of them to refine our  
 initial estimates $\bOmega_i^{(0)}$ and bring us into the region of  
 the optimum before switching to Newton iterations on the unconstrained  
 parameter vector $\btheta$.  
 The ECME iterations can be defined in terms of the $\bOmega_i$.  At  
 the $\nu$th iteration, $\bOmega^{(\nu+1)}$ is defined to be the  
 repeated block/block diagonal, symmetric, positive definite matrix  
 that satisfies the $k$ systems of equations  
 for ML estimates or the $k$ systems  
282  \begin{equation}  \begin{equation}
283    \label{eq:theta1R}    \label{eq:Cholesky}
284    \tr\left[\der_{\bOmega_i}\bOmega\left(    \bP\left[\bZ\trans\bZ+\bOmega(\btheta)\right]\bP\trans = \bL(\btheta)\bL(\btheta)\trans
285  \end{equation}  \end{equation}
 for REML.  
287  The results of \S\ref{sec:Frechet} provide  We revise (\ref{eq:lmeGeneral}) by incorporating the permutation to obtain
     \tr \left[ \left( \der_{\bOmega_i} \bOmega \right)  
 for ML and  
     \tr \left[ \left( \der_{\bOmega_i} \bOmega \right)\vb  
 for REML.  
 \subsection{Gradient evaluations}  
 The gradients of (\ref{eq:ProfiledLogLik}) and (\ref{eq:ProfiledLogRestLik})  
 with respect to $\bOmega_i$ are  
   \nabla_{\bOmega_i} f_R&=\tr\left[\der_{\bOmega_i}\bOmega\left(  
 and we have already described how to calculate all those terms.  
 \section{Evaluation of the Hessian}  
 The Hessians of the scalar functions $f$ and $f_R$ are symmetric  
 matrices of second derivatives.  As for the calculation of the  
 gradient, we first evaluate the Hessian with respect to the  
 $Q=\sum_{i=1}^k q_i^2$ dimensional vector  
 where $\bomega_i=\ovec\bOmega_i$ and `$\ovec$' is the vectorization  
 function that produces a column vector from a matrix by concatenating  
 the columns.  That is, we temporarily ignore the fact that the  
 $\bOmega_i$ must be positive definite and symmetric and we consider each of the  
 elements in these matrices separately.  In \S\ref{sec:Unconstrained}  
 we describe an unconstrained parameter vector $\btheta$ of length  
 $\sum_{i=1}^k\binom{q_i+1}{2}$ and the conversion of the gradient and  
 Hessian for $\bomega$ to the corresponding quantities for $\btheta$.  
 For convenience we index the $Q$ elements of $\bomega$ by triplets  
 $i:r,c$ denoting the element at row $r$ and column $c$ of  
 $\bOmega_i$.  \citet{bate:debr:2004} show that the element in row  
 $i:r,c$ and column $j:s,t$ of the Hessian is  
     \left\{\nabla_{\bomega}^2 f\right\}_{i:r,c;j:s,t}=  
 for ML estimates, and  
288  \begin{equation}  \begin{equation}
289    \label{eq:HessianREML}    \label{eq:lmePerm}
290    \begin{aligned}    \by=\bX\bbeta+\bZ\bP\trans\bP\bb+\beps\quad
291      \left\{\nabla_{\bomega}^2 f_R\right\}_{i:r,c;j:s,t}=    \beps\sim\mathcal{N}(\bzer,\sigma^2\bI),
292      &\tr\left[\der_{i:r,c}\left(\der_{j:s,t}\bOmega\right)    \bb\sim\mathcal{N}(\bzer,\sigma^2\bSigma),
293        \left(\vb-\bOmega\inv+    \beps\perp\bb
294  \end{equation}  \end{equation}
 for REML.  
 The matrix $\der_{j:s,t}\bOmega$ is constant so the first term in both  
 (\ref{eq:HessianML}) and (\ref{eq:HessianREML}) is zero.  
297  \section{Unconstrained parameterization}  \subsection{Extent of the sparsity}
298  \label{sec:Unconstrained}  \label{sec:sparsity}
300  The vector  Table~\ref{tbl:sparse} shows the extent of the sparsity of the
301  $\btheta=\left[\btheta_1\trans,\dots,\btheta_k\trans\right]\trans$ is  matrices $\bZ$, $\bZ\trans\bZ$ and $\bL$ in our examples.
 an unconstrained parameterization of $\bOmega$ based on the LDL form  
 of the Cholesky decomposition of the $\bOmega_i$  
   \bOmega_i=\bL_i\bD_i\bL_i\trans .  
 where $\bL_i$ is a $q_i\times q_i$ unit lower triangular matrix and  
 $\bD_i$ is a $q_i\times q_i$ diagonal matrix with positive diagonal  
 elements.  We use the $q_i$ logarithms of the diagonal elements of  
 $\bD_i$, written $\bdelta_i$, and, when $q_i>1$, the row-wise  
 concatenation of the $\binom{q_i}{2}$ elements in the strict lower  
 triangle of $\bL_i$, written $\blambda_i$, as the $\binom{q+1}{2}$  
 dimensional unconstrained parameter  
 Let $g$ be any scalar function of the $\bOmega_i$ such that the  
 $q_i\times q_i$ gradient matrices $\nabla_{\bdelta_i}f$ are symmetric.  
 (Both $f$, the objective for ML estimation, and $f_R$, the objective  
 for REML estimation, are such functions.)  The components  
 $\nabla_{\bdelta_i}g$ and $\nabla_{\blambda_i}g$ of the gradient  
 vector $\nabla_{\btheta_i}g$ can be evaluated from the symmetric  
 gradient matrix $\nabla_{\bOmega_i}g$ and the  
 derivative of the product (\ref{eq:OmegaLDL}).  For example,  
       \frac{\partial \bOmega_i}{\partial \left\{\bdelta_i\right\}_j}\right]\\  
 for $j=1,\dots,q_i$, where $\bR_i$ is the upper Cholesky factor of $\bOmega_i$.  
 The partial derivative with respect to  
 element $t$ of $\blambda_i$, which determines the $r,c$th element of  
 $\bL_i$, is  
303  \section{Acknowledgements}  The matrix $\bL$ is the supernodal representation of the left Cholesky
304  \label{sec:Ack}  factor of $\bP\left(\bZ\trans\bZ+\bOmega\right)\bP\trans$.  Because
305    the fill-reducing permutation $\bP$ has been applied the number of
306  This work was supported by U.S.{} Army Medical Research and Materiel  nonzeros in $\bL$ will generally be less than the number of nonzeros
307  Command under Contract No.{} DAMD17-02-C-0119.  The views, opinions  in the left Cholesky factor of $\bZ\trans\bZ+\bOmega$.  However, when
308  and/or findings contained in this report are those of the authors and  any supernodes of $\bL$ contain more than one column there will be
309  should not be construed as an official Department of the Army  elements above the diagonal of $\bL$ stored and these elements are
310  position, policy or decision unless so designated by other  necessarily zero.  They are stored in the supernodal factorization so
311  documentation.  that the diagonal block for a supernode can be treated as a dense
312    rectangular matrix.
313  \appendix  
314    Generally storing these few elements from above the diagonal is a
315  \section{Sparse matrix formats}  minor expense compared to the speed savings of using dense matrix
316  \label{app:Sparse}  operations and being about to take advantage of accelerated level-3
317    BLAS (Basic Linear Algebra Subroutines) code.  Furthermore they are
318  The basic form of sparse matrix storage, called the \emph{triplet}  never used in computation because the diagonal block is treated as a
319  form, represents the matrix as three arrays of length \code{nz}, which  densely stored lower triangular matrix.  We mention this so the
320  is the number of nonredundant nonzero elements in the matrix.  The  interested reader can understand in \code{Cm1}, for example,
321  \code{i} array contains the row indices, the \code{j} array the column  $\nz(\bZ\trans\bZ)$ is 54 ($=18\times 3$) yet $\nz(\bL)$ is 72
322  indices, and the \code{x} array the values of the nonzero elements.  A  ($=18\times 4$).  In this example there is no fill-in so the number of
323  \emph{column-oriented} triplet form requires that the elements of  nonzeros in $\bL$ is actually 54 but the matrix is stored as 18
324  these arrays be such that \code{j} is nondecreasing.  A sorted,  lower-triangular blocks that are each $2\times 2$.
325  column-oriented triplet form is such that elements in \code{i}  \begin{table}[tb]
326  corresponding to the same \code{j} index are in increasing order. That    \centering
327  is, the arrays are ordered by column and, within column, by row.    \begin{tabular}{r r r r r r r r r r r}
328        \hline\\
329  In column-oriented storage, multiple elements in the same column      \multicolumn{1}{c}{Name}&\multicolumn{1}{c}{$n$}& \multicolumn{1}{c}{$k$}&
330  produce runs in \code{j}.  A \emph{compressed}, column-oriented      \multicolumn{1}{c}{$q$}&
331  storage form replaces \code{j} by \code{p}, an array of pointers to      \multicolumn{1}{c}{$\nz(\bZ)$}&\multicolumn{1}{c}{$\spr(\bZ)$}&
332  the indices (in \code{i} and \code{x}) of the first element in each      \multicolumn{1}{c}{$\nz(\bZ\trans\bZ)$}&\multicolumn{1}{c}{$\spr(\bZ\trans\bZ)$}&
333  column.  If there are $n$ columns in the matrix, this array is of      \multicolumn{1}{c}{$\nz(\bL)$}&\multicolumn{1}{c}{$\spr(\bL)$}\\
334  length $n+1$, with the last element of \code{p} being one greater than      \hline
335  the index of last element in the last column.  Then the differences of      \code{Sm1}&  180&1&   36&  360&0.0556&   54&0.0811&   72&0.1081\\
336  successive elements of \code{p} give the number of nonzeros in each      \code{Cm1}&31022&2&   36&  360&0.0556&   54&0.0811&   72&0.1081\\
337  column.      \hline
338      \end{tabular}
339  In the implementation in the \code{Matrix} package for R, the indices    \caption{Summary of the sparsity of the model matrix $\bZ$, its
340  \code{i}, \code{j}, and \code{p} are \code{0}-based, as in the C      crossproduct matrix $\bZ\trans\bZ$ and the left Cholesky factor
341  language, (i.e.{} the first element of an array has index \code{0})      $\bL$ in the examples.  The notation $\nz{}$ indicates the number of nonzeros in the
342  not \code{1}-based, as in the S language.  Thus the first element of      matrix and $\spr{}$ indicates the sparsity index (the fraction of the elements
343  \code{p} is always zero.      in the matrix that are non-zero). Because $\bZ\trans\bZ$ is
344        symmetric only the nonzeros in the upper triangle are counted and
345        the sparsity index is relative to the total number of elements in
346  \section{Blocked sparse matrix operations}      the upper triangle.  The number of nonzeros and the sparsity index
347  \label{app:blocked}      of $\bL$ are calculated for the supernodal decomposition, which
348        can include some elements that are above the diagonal and hence must
349  We take advantage of the blocked sparse structure of $\bZ\trans\bZ$,      be zero.}
350  $\bOmega$, $\bL$ and $\bD$ in operations involving these matrices.      \label{tbl:sparse}
351  Consider, for example, the operation of solving for $\tRZX$ in the  \end{table}
 system (\ref{eq:RZX}),  
 $\bL\bD^{\mathsf{T}/2}\tRZX=\bZ\trans\tilde{\bX}$.  Let $\bu$ be a  
 column of $\bD^{\mathsf{T}/2}\tRZX$ and $\bv$ be the corresponding  
 column of $\bZ\trans\tilde{\bX}$.  The corresponding column of the  
 result, which is $\bD^{-\mathsf{T}/2}\bu$, is easily evaluated from  
 $\bu$. (Recall that $\bD^{-1/2}$ is block/block diagonal, i.e. block  
 diagonal in $k$ outer blocks of sizes $m_i q_i\times m_i q_i$ each of  
 which is itself block diagonal in $m_i$ blocks of size $q_i\times  
 q_i$, and that $\bD^{-1/2}$ is calculated and stored during the  
 inversion step.)  Both $\bu$ and $\bv$ can be divided into $k$ outer  
 blocks of sizes $m_i q_i$, which we denote $\bu_i$ and $\bv_i$, and  
 each of the outer blocks can be divided into $m_i$ inner blocks of  
 size $q_i$, which we denote $\bu_{i:j}$ and $\bv_{i:j}$.  
 Because $\bL$ is lower triangular, the blocks of $\bv$ can be  
 evaluated in a blockwise forward solve.  That is, the blocked  
 equations are solved in the order  
     \bL_{(1,1)}\bu_1 &= \bv_1\quad\Longrightarrow\quad \bI\bu_1 =  
     \bu_1 = \bv_1\\  
     \bL_{(2,2)}\bu_2 &= \bv_2-\bL_{(2,1)}\bu_1\\  
     \bL_{(k,k)}\bu_k &= \bv_k-\bL_{(k,k-1)}\bu_{k-1}-\dots-\bL_{(k,1)}\bu_1  
 Notice that the solution in the first block does not require any  
 calculation because $\bL_{(1,1)}$ is always the identity.  In many  
 applications $k=1$ and the operation of solving for $\bL\bu=\bv$ can  
 be skipped entirely.  In most other applications the size of $\bu_1$  
 dominates the size of the other components so that being able to set  
 $\bu_1=\bv_1$ represents a tremendous savings in computational effort.  
 We take advantage of the blocked, sparse columns of the outer blocks  
 of $\bL$ in evaluations of the form $\bv_2-\bL_{(2,1)}\bu_{1}$.  That  
 is, we evaluate $\bL_{(2,1)}\bu_{1}$ using the blocks of size  
 $q_2\times q_1$ skipping blocks that we know will be zero as we update  
 the inner blocks of $\bu_2$ in place.  The solutions of systems of the  
 form $\bL_{(i,i)}\bu_i=\tilde{\bv}_i$ where $\tilde{\bv}_i$ is the  
 updated $\bv_i$ are also done in inner blocks taking advantage of the  
 sparsity of $\bL_{(i,i)}^{-1}$.  Recall that in the case of nested grouping  
 factors the $\bL_{(i,i)}$ are all identity matrices and this step is  
354  \subsection{Accumulating diagonal blocks of the inverse}  \section{Likelihood and restricted likelihood}
355  \label{ssec:InverseBlocks}  \label{sec:likelihood}
357  The ECME steps and the gradient evaluation require  In general the \emph{maximum likelihood estimates} of the parameters
358  \begin{equation}  in a statistical model are those values of the parameters that
359    \label{eq:InverseBlocks}  maximize the likelihood function, which is the same numerical value as
360      \tr\left[\der_{\bOmega_i}\bOmega\left(\bZ\trans\bZ+\bOmega\right)^{-1}\right]  the probability density of $\by$ given the parameters but regarded as
361      = \tr\left[\left(\der_{\bOmega_i}\bOmega\right)  a function of the parameters given $\by$, not as a function of $\by$
362        \left(\bD^{-\mathsf{T}/2}\bL\inv\right)\trans  given the parameters.
363        \left(\bD^{-\mathsf{T}/2}\bL\inv\right)\right]  
364    For model (\ref{eq:lmePerm}) the parameters are
365    $\bbeta$, $\sigma^2$ and $\btheta$ (as described in
366    \S\ref{sec:relativePrecision}, $\btheta$ and $\sigma^2$ jointly
367    determine $\bSigma$) so we evaluate the likelihood $L(\bbeta,
368    \sigma^2, \btheta|\by)$ as
369    \begin{equation}
370      \label{eq:Likelihood1}
371      L(\bbeta, \sigma^2, \btheta|\by) = f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)
372    \end{equation}
373    where $f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$
374    is the marginal probability density for $\by$ given the parameters.
376    Because we will need to write several different marginal and
377    conditional probability densities in this section and expressions like
378    $f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$ are
379    difficult to read, we will adopt a convention sometimes used in the
380    Bayesian inference literature that a conditional expression in square
381    brackets indicates the probability density of the quantity on the left
382    of the $|$ given the quantities on the right of the $|$.  That is
383    \begin{equation}
384      \label{eq:Bayesnotation}
385      \left[\by|\bbeta,\sigma^2,\btheta\right]=f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)
386    \end{equation}
388    Model (\ref{eq:lmePerm}) specifies the conditional distributions
389    \begin{equation}
390      \label{eq:ycond}
391      \left[\by|\bbeta,\sigma^2,\bb\right]=
392      \frac{\exp\left\{-\|\by-\bX\bbeta-\bZ\bP\trans\bP\bb\|^2/\left(2\sigma^2\right)\right\}}
393      {\left(2\pi\sigma^2\right)^{n/2}}
394    \end{equation}
395    and
396    \begin{equation}
397      \label{eq:bmarg}
398      \begin{split}
399        \left[\bb|\btheta,\sigma^2\right]&=
400        \frac{\exp\left\{-\bb\trans\bSigma^{-1}\bb/2\right\}}
401        {|\bSigma|^{1/2}\left(2\pi\right)^{q/2}}\\
402        &=\frac{|\bOmega|^{1/2}\exp\left\{-\bb\trans\bOmega\bb/\left(2\sigma^2\right)\right\}}
403        {\left(2\pi\sigma^2\right)^{q/2}}
404      \end{split}
405    \end{equation}
406    from which we can derive the marginal distribution
407    \begin{equation}
408      \label{eq:ymarg}
409      \begin{split}
410        \left[\bb|\btheta,\sigma^2\right]&=
411        \int_{\bb}\left[\by|\bbeta,\sigma^2,\bb\right]\left[\bb|\btheta,\sigma^2\right]\,d\bb\\
412        \frac{|\bOmega|^{1/2}}{\left(2\pi\sigma^2\right)^{n/2}}
413        \int_{\bb}
414        \frac{\exp\left\{-\left[\|\by-\bX\bbeta-\bZ\bP\trans\bP\bb\|^2+\bb\trans\bOmega\bb\right]
415            /\left(2\sigma^2\right)\right\}}
416        {\left(2\pi\sigma^2\right)^{q/2}}\,d\bb
417      \end{split}
418  \end{equation}  \end{equation}
 which will be the sum of the crossproducts of the $m_i$ inner blocks  
 of $q_i$ columns in the $i$th outer block of columns of  
 When $i=k=1$ the result is  
 $\sum_{j=1}^{m_1}\bD^{-1/2}_{1:j}\bD^{-\mathsf{T}/2}_{1:j}$, which can  
 be evaluated in a single call to the level-3 BLAS routine  
 \code{dsyrk}.  (Having the 3 dimensional array containing the $m_i$  
 inner blocks of the $i$th outer block of $\bD^{-1/2}$ in the form that  
 allows this to be done in a single, level-3 BLAS call is the reason  
 that we store and invert the upper triangular factors of inner blocks  
 of $\bD$.)  
 When $i=1$ and $k>1$ we initialize the result from the $(1,1)$ block  
 as above, then add the crossproducts of inner blocks of columns in the  
 outer $(2,1)$ block, the outer $(3,1)$ block, and so on.  These blocks  
 of columns are calculated sparsely.  During the initial decomposition  
 we evaluate and store (in blocked, sparse format) $\bL^{-1}_{(i,i)}$  
 for $1<i\le k$.  
420  \bibliography{lme4}  \bibliography{lme4}
421  \end{document}  \end{document}
 \section{Frechet derivatives}  
 In sections that follow we will use the symbol $\der$ for the Frechet  
 derivative.  When used with a vector subscript, such as in  
 $\der_{\theta_i}\bOmega$, it represents a three dimensional array  
 where each face is the same size as $\bOmega$ and the number of faces  
 is the number of elements in $\btheta_i$.  When used with a matrix  
 subscript, such as $\der_{\bOmega_i}\bOmega$ it represents a four  
 dimensional array.  When used without a subscript it indicates the  
 pattern that the expression will take when a specific subscript is  
 In the gradient and ECME calculations the $\der$ symbol occurs in  
 expressions of the form  
 where $\tr$ denotes the trace operator and $\bM$ is an  
 $M\times M$ matrix ($M=\sum_{i=1}^k m_i q_i$).  If we use  
 a vector subscript on $\der$, the expression evaluates to a vector of the same length  
 as the subscript vector.  If we use a matrix subscript, it  
 evaluates to a matrix of the same dimensions as the subscript matrix.  
 Let us consider the specific example of  
     &= \bJ_i\ovec\left(\tr\left[\left(\der_{\bOmega_i}\right)\bOmega^{-1}  
     &= \bJ_i\ovec\left(\sum_{j=1}^{m_i}\bOmega^{-1}_{i,i,j,j}\right)\\  
     &= m_i\bJ_i\ovec\left[\left(\bOmega_i\right)^{-1}\right]  
 \section{Gradient and ECME step evaluations}  
 The gradients of \ref{eq:ProfiledLogLik} and \ref{eq:ProfiledLogRestLik}  
   \nabla f&=\tr\left[\der\bOmega\left(  
   &=\tr\left[\left(\der \bOmega\right)(\bZ\trans\bZ+\bOmega)^{-1}\right]  
   -\tr\left[\left(\der \bOmega\right)\bOmega^{-1}\right]  
   +\tr\left[\left(\der \bOmega\right)\left(\frac{\widehat{\bb}}{\widehat{\sigma}}  
   \nabla f_R&=\tr\left[\der\bOmega\left(  
   &=\tr\left[\left(\der \bOmega\right)\vb\right]  
   -\tr\left[\left(\der \bOmega\right)\bOmega^{-1}\right]  
   +\tr\left[\left(\der \bOmega\right)\left(\frac{\widehat{\bb}}  
 where $\der$ denotes the Frechet derivative.  
 The gradient calculation is closely related to the update step in the  
 ECME algorithm described in \citep{bate:debr:2004}.  This algorithm,  
 which provides a stable way of refining starting estimates  
 $\btheta^{(0)}$ to get into the region of the optimum, determines  
 $\btheta^{(\nu+1)}$ from $\btheta^{(\nu)}$ as the value that satisfies  
 Let us consider each of the terms in (\ref{eq:gradDev}),  
 (\ref{eq:gradDevRest}), (\ref{eq:theta1}). and (\ref{eq:theta1R}) when  
 the derivative is with respect to $\bOmega_i$.  We have already shown that  
   =m_i \bOmega_i^{-1}  
 The Frechet derivative  
 with respect to some parameterization of $\bOmega$ is a three  
 dimensional array of size $M\times M\times \ell$ where  
 $M=\sum_{i=1}^k m_i q_i$ is the size of $\bOmega$ and $\ell$ is the  
 dimension of the parameter vector.  Because each of element of  
 $\btheta$ affects only one $\bOmega_i$, we can evaluate the gradients  
 and the ECME updates by first evaluating the derivatives with respect to  
 the elements of $\bOmega_i$.  For each of the terms in the expressions  
 for the gradients and the ECME update we will form a list of $k$  
 matrices, the $i$th of which is of size $q_i\times q_i$.  For example,  
 is a $q_i\times q_i$ matrix with  
   =m_i \bbe_r\bOmega_i^{-1}\bbe_c\trans  
 in row $r$ and column $c$.  
 The slots include \code{ZtX}, which contains $\bZ\trans\tilde{\bX}$  
 stored densely; \code{XtX}, which contains  
 $\tilde{\bX}\trans\tilde{\bX}$ stored densely; and the triplet of  
 slots \code{ZZp}, \code{ZZi}, and \code{ZZx} that provide the sparse,  
 symmetric, blocked storage for $\bZ\trans\bZ$. As described in  
 Appendix~\ref{app:Sparse}, the column pointers \code{ZZp} and the row  
 indices \code{ZZi} are from the sparse symmetric representation of the  
 pairwise crosstabulation matrix.  The column pointer vector is of  
 length $M+1$ where $M=\sum_{i=1}^k m_i$ is the size of the pairwise  
 crosstabulation.  The length of the row index vector is the number of  
 nonzeros in the upper triangle of the pairwise crosstabulation.  The  
 \code{ZZx} slot is a list of $\binom{k+1}{2}$ 3-dimensional numeric  
 arrays, corresponding to the outer blocks of rows and columns in the  
 upper triangle of $\bZ\trans\bZ$.  The array for the $(i,j)$ block is  
 of size $q_i\times q_j\times\mathtt{nz}_{i,j}$, where  
 $\mathtt{nz}_{i,j}$ is the number of nonzeros in the $(i,j)$ block of  
 the pairwise crosstabulation.  
 The \code{Omega} slot is a list of $k$ matrices of sizes $q_i\times  
 q_i$ that contain the $\bOmega_i$.  Only the upper triangles of these  
 symmetric matrices are stored.  Initial values for these matrices are  
 calculated from the $\bZ_i$ in a separate function call.  
 The $\binom{q_i+1}{2}$ dimensional vector $\btheta_i$ is an  
 unconstrained parameter for $\bOmega_i$.  When $q_i=1$, $\bOmega_i$ is  
 a $1\times 1$ positive definite matrix, which we can consider as a  
 positive scalar, and we set  
 $\btheta_i=\log\left(\left\{\bOmega_i\right\}_{1,1}\right)$.  Details  
 of the parameterization for $q_i>1$, including the  
 $\binom{q_i+1}{2}\times q_i^2$ Jacobian  
 $\bJ_i=\nabla_{\btheta_i}\ovec(\bOmega_i)$ and the  
 $\binom{q_i+1}{2}\times\binom{q_i+1}{2}\times q_i^2$ second derivative  
 array $\nabla_{\btheta_i}^2\ovec(\bOmega_i)$, are given in  
 For $g_i$ a scalar function of $\bOmega_i$ we denote by  
 $\nabla_{\Omega_i}g_i$ the $q_i\times q_i$ gradient matrix with elements  
   \frac{\partial g_i}{\partial\left\{\bOmega_i\right\}_{s,t}}\quad s,t=1,\dots,q_i .  
 Although $\bOmega_i$ is required to be symmetric, we do not impose  
 this restriction on $\nabla_{\Omega_i}g_i$. (However, for all the  
 functions $g_i(\bOmega_i)$ that we will consider this gradient is  
 indeed symmetric.)  
 The $\binom{q_i+1}{2}$ dimensional gradient  
   =\bJ_i \ovec(\nabla_{\bOmega_i}g_i)  
 The $\binom{q_i+1}{2}\times\binom{q_i+1}{2}$ Hessian  
   \nabla_{\btheta_i}^2 g_i(\bOmega_i)  
   = \bJ_i \ovec(\nabla_{\bOmega_i}^2 g_i)\bJ_i\trans  
   \left[\ovec(\nabla_{\bOmega_i} g_i)\right]  
 where the ``square bracket'' multiplication in the last term  
 represents an inner product over the last index in  
 Eventually we want to produce the $Q=\sum_{i=1}^k\binom{q_i+1}{2}$  
 dimensional vector $\tr\left[(\der_{\btheta}\bOmega)(\bM)\right]$ for  
 some matrix $\bM$ of the same size as $\bOmega$.  We obtain the  
 gradient with respect to $\btheta_i$ as  
 and concatenate these vectors to produce  

Removed from v.1210  
changed lines
  Added in v.1211
ViewVC Help
Powered by ViewVC 1.0.0  
Thanks to:
Vienna University of Economics and Business Powered By FusionForge