# SCM Repository

[matrix] View of /branches/trunk-lme4/inst/doc/Implementation.Rnw
 [matrix] / branches / trunk-lme4 / inst / doc / Implementation.Rnw

# View of /branches/trunk-lme4/inst/doc/Implementation.Rnw

Tue Jan 31 00:25:18 2006 UTC (15 years, 1 month ago) by bates
File size: 18274 byte(s)
More on implementation
\documentclass[12pt]{article}
\usepackage{Sweave}
\usepackage{myVignette}
\usepackage[authoryear,round]{natbib}
\bibliographystyle{plainnat}
\DefineVerbatimEnvironment{Sinput}{Verbatim}
{formatcom={\vspace{-2.5ex}},fontshape=sl,
fontfamily=courier,fontseries=b, fontsize=\scriptsize}
\DefineVerbatimEnvironment{Soutput}{Verbatim}
{formatcom={\vspace{-2.5ex}},fontfamily=courier,fontseries=b,%
fontsize=\scriptsize}
%%\VignetteIndexEntry{Implementation Details}
%%\VignetteDepends{Matrix}
%%\VignetteDepends{lattice}
%%\VignetteDepends{lme4}
\begin{document}
\SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,strip.white=TRUE}
\SweaveOpts{prefix=TRUE,prefix.string=figs/Example,include=FALSE}
\setkeys{Gin}{width=\textwidth}
\title{Linear mixed model implementation in lme4}
\author{Douglas Bates\\Department of Statistics\\University of
\maketitle
\begin{abstract}
Expressions for the evaluation of the profiled log-likelihood or
profiled log-restricted-likelihood of a linear mixed model, the
gradients and Hessians of these criteria, and update steps for an
ECME algorithm to optimize these criteria are given in Bates and
DebRoy (2004). In this paper we generalize those formulae and
describe the representation of mixed-effects models using sparse
matrix methods available in the \code{Matrix} package.
\end{abstract}

<<preliminaries,echo=FALSE,print=FALSE>>=
library(lattice)
library(Matrix)
#library(lme4)
#data("Early", package = "mlmRev")
options(width=80, show.signif.stars = FALSE,
lattice.theme = function() canonical.theme("pdf", color = FALSE))
@
\section{Introduction}
\label{sec:Intro}

General formulae for the evaluation of the profiled log-likelihood and
profiled log-restricted-likelihood in a linear mixed model are given
in \citet{bate:debr:2004}.  Here we describe a more general
formulation of the model using sparse matrix decompositions and the
implementation of these methods in the \code{lmer} function for \RR{}.

In \S\ref{sec:Form} we describe the form and representation of the
model.  The calculation of the criteria to be optimized by the
parameter estimates and related quantities is discussed in
\S\ref{sec:Cholesky}.  Details of the calculation of the ECME step and the
evaluation of the gradients of the criteria are given in
\S\ref{sec:ECME} and those of the Hessian in \S\ref{sec:Hessian}.  In
\S\ref{sec:Unconstrained} we give the details of an unconstrained
parameterization for the model and the transformation of our
results to this parameterization.

\section{Form and representation of the model}
\label{sec:Form}

We consider linear mixed models of the form
$$\label{eq:lmeGeneral} \by=\bX\bbeta+\bZ\bb+\beps\quad \beps\sim\mathcal{N}(\bzer,\sigma^2\bI), \bb\sim\mathcal{N}(\bzer,\bSigma), \beps\perp\bb$$
where $\by$ is the $n$-dimensional response vector, $\bX$ is an
$n\times p$ model matrix for the $p$ dimensional fixed-effects vector
$\bbeta$, $\bZ$ is the $n\times q$ model matrix for the $q$
dimensional random-effects vector $\bb$, which has a Gaussian
distribution with mean $\bzer$ and variance-covariance matrix
$\bSigma$, and $\beps$ is the random noise assumed to have a spherical
Gaussian distribution.  The symbol $\perp$ indicates independence of
random variables.

We will assume that $\bX$ has full column rank and that $\bSigma$ is
positive definite.

\subsection{Structure of the variance-covariance matrix}
\label{sec:sigmaStructure}

Components of the random effects vector $\bb$ and portions of its
variance-covariance matrix $\bSigma$ are associated with $k$ grouping
factors $\bbf_i, i=1,\dots,k$, each of length $n$, and with the $n_i, i = 1,\dots,k$ levels of each of the grouping factors.  In general
there are $q_i$ components of $\bb$ associated with each of the $n_i$
levels the grouping factor $\bbf_i, i = 1,\dots,k$.  Thus
$$\label{eq:qdef} q = \sum_{i=1}^k n_i q_i$$

We assume that the components of $\bb$ and the rows and columns of
$\bSigma$ are ordered according to the $k$ grouping factors and,
within the block for the $i$th grouping factor, according to the $n_i$
levels of the grouping factor.

Random effects associated with different grouping factors are
independent.  This implies that $\bSigma$ is block-diagonal with $k$
diagonal blocks of orders $n_i q_i, i=1,\dots,k$.

Random effects associated with different levels of the same grouping
factor are independent.  This implies that the $i$th (outer) diagonal
block of $\bSigma$ is itself block diagonal in $n_i$ blocks of order $q_i$.
We say that the structure of $\bSigma$ is block/block diagonal.

Finally, the variance-covariance matrix within each of the
$q_i$-dimensional subvectors of $\bb$ associated with one of the $n_i$
levels of grouping factor $\bbf_i, i=1,\dots,k$ is a constant (but
unknown) positive-definite symmetric $q_i\times q_i$ matrix
$\bSigma_i,i=1,\dots,k$.  This implies that each of the $n_i$ inner
diagonal blocks of order $q_i$ is a copy of $\bSigma_i$.  We say that
$\bSigma$ has a \emph{repeated block/block diagonal} structure.

\subsection{The relative precision matrix}
\label{sec:relativePrecision}

Many of the computational formulae are more conveniently expressed in
terms of $\bSigma^{-1}$, which is called the
\emph{precision} matrix of the random effects, instead of $\bSigma$.
In fact, the formulae are most conveniently expressed in terms of the
\emph{relative precision matrix} $\sigma^2\bSigma^{-1}$ which we write
as $\bOmega$.  That is,
$$\label{eq:relPrec} \bOmega = \sigma^2\bSigma^{-1}$$

This called the relative'' precision because it is precision of
$\bb$ ($\bSigma^{-1}$) relative to the precision of $\beps$
($\sigma^{-2}\bI$).

It is easy to establish that $\bOmega$ will have a repeated
block/block diagonal structure like that of $\bSigma$.  That is,
$\bOmega$ consists of $k$ outer diagonal blocks of sizes $n_i q_i, i = 1,\dots,k$ and the $i$th outer diagonal block is itself block diagonal
with $n_i$ inner blocks of size $q_i\times q_i$.  Furthermore, each of
the inner diagonal blocks in the $i$th outer block is a copy of the
$q_i\times q_i$ positive-definite, symmetric matrix $\bOmega_i$.

Because $\bOmega$ has a repeated block/block structure we can define
the entire matrix by specifying the symmetric matrices $\bOmega_i, i = 1,\dots,k$ and, because of the symmetry, $\bOmega_i$ has at most
$q_i(q_i+1)/2$ distinct elements.  We will write $\btheta$ for a
parameter vector of length at most $\sum_{i=1}^{k}q_i(q_i+1)/2$ that
determines $\bOmega$.  For example, we could use the non-redundant
elements in the $\bOmega_i$ as the components of $\btheta$.  In fact
we use a different, but equivalent, parameterization for reasons to be
discussed later.

We only need to store the matrices $\bOmega_i,i=1,\dots,k$ and the
number of levels in the grouping factors to be able create $\bOmega$.
The matrices $\bOmega_i$ are stored in the \code{Omega} slot of an
object of class \code{"lmer"}. The values of $k$ and $n_i, i=1,\dots,k$ can be determined from the list of the grouping factors
themselves, stored as the \code{flist} slot, or from the dimensions
$q_i,i=1,\dots,k$, stored in the \code{nc} slot, and the group
pointers, stored in the \code{Gp} slot.  Successive differences of the
group pointers are the total number of components of $\bb$ associated
with the $i$th grouping factor.  That is, these differences are $n_i q_i,i=1,\dots,k$.  The first element of the \code{Gp} slot is always 0.

\subsection{Examples}
\label{sec:OmegaExamp}

Consider the fitted models
<<fmi>>=
Sm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
data(Chem97, package = "mlmRev")
Cm1 <- lmer(score ~ gcsescore + (1|school) + (1|lea), Chem97,
control = list(niterEM = 0, gradient = FALSE))
data(star, package = "mlmRev")
Mm1 <- lmer(math ~ gr+sx*eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
control = list(niterEM = 0, gradient = FALSE))
@

Model \code{Sm1} has a single grouping factor with $18$ levels.  The
\code{Omega} slot is a list of length one containing a $2\times 2$
symmetric matrix.
<<Sm1grp>>=
str(Sm1@flist)
show(Sm1@Omega)
show(Sm1@nc)
show(Sm1@Gp)
diff(Sm1@Gp)/Sm1@nc
@

Model \code{Cm1} has two grouping factors: the \code{school} factor
with 2410 levels and the \code{lea} factor (local education authority
- similar to a school district in the U.S.A.) with 131 levels.  It
happens that the \code{school} factor is nested within the \code{lea}
factor, a property that we discuss below. The \code{Omega} slot is a
list of length two containing two $1\times 1$ symmetric matrices.
<<Cm1grp>>=
str(Cm1@flist)
show(Cm1@Omega)
show(Cm1@nc)
show(Cm1@Gp)
diff(Cm1@Gp)/Cm1@nc
@

Model \code{Mm1} has three grouping factors: \code{id} (the student)
with 10732 levels, \code{tch} (the teacher) with 1374 levels and
\code{sch} (the school) with 80 levels. The \code{Omega} slot is a
list of length three containing two $2\times 2$ symmetric matrices and
one $1\times 1$ matrix.
<<Mm1grp>>=
str(Mm1@flist)
show(Mm1@Omega)
show(Mm1@nc)
show(Mm1@Gp)
diff(Mm1@Gp)/Mm1@nc
@

The last element of the \code{Gp} slot is the dimension of $\bb$.
Notice that for model \code{Mm1} the dimension of $\bb$ is 22,998.
This is also the order of the symmetric matrix $\bOmega$ although the
contents of the matrix are determined by $\btheta$ which has a length
of $3+1+3=7$ in this case.

Table~\ref{tbl:dims} summarizes some of the dimensions in these examples.

\begin{table}[tb]
\centering
\begin{tabular}{r r r r r r r r r r r}
\hline\\
\multicolumn{1}{c}{Name}&\multicolumn{1}{c}{$n$}& \multicolumn{1}{c}{$k$}&
\multicolumn{1}{c}{$n_1$}&\multicolumn{1}{c}{$q_1$}&
\multicolumn{1}{c}{$n_2$}&\multicolumn{1}{c}{$q_2$}&
\multicolumn{1}{c}{$n_3$}&\multicolumn{1}{c}{$q_3$}&
\multicolumn{1}{c}{$q$}&\multicolumn{1}{c}{$\#(\btheta)$}\\
\hline
\code{Sm1}&  180&1&   18&2&    & &  & &   36&3\\
\code{Cm1}&31022&2& 2410&1& 131&1&  & & 2541&2\\
\code{Mm1}&24578&3&10732&2&1374&1&80&2&22998&7\\
\hline
\end{tabular}
\caption{Summary of various dimensions of model fits used as
examples.}
\label{tbl:dims}
\end{table}

\subsection{Permutation of the random-effects vector}
\label{sec:permutation}

For most mixed-effects model fits, the model matrix $\bZ$ for the
random effects vector $\bb$ is large and sparse.  That is, most of the entries
in $\bZ$ are zero (by design, not by accident).

Numerical analysts have developed special techniques for representing
and manipulating sparse matrices. Of particular importance to us are
techniques for obtaining the left Cholesky factor $\bL(\btheta)$ of
the large, sparse, positive-definite, symmetric matrices.  In
particular, we want to obtain the Cholesky factorization of
$\bZ\trans\bZ+\bOmega(\btheta)$ for different values of $\btheta$.

Sparse matrix operations are typically performed in two phases: a
\emph{symbolic phase} in which the number of non-zero elements in the
result and their positions are determined followed by a \emph{numeric
phase} in which the actual numeric values are calculated.  Advanced
sparse Cholesky factorization routines, such as the CHOLMOD library
(Davis, 2005) which we use, allow for calculation of a fill-reducing
permutation of the rows and columns during the symbolic phase.  In
fact the CHOLMOD code allows for evaluation of both a fill-reducing
permutation and a post-ordering that groups the columns of $\bL$ into
blocks with identical row structure, thus allowing for the dense
matrix techniques to be used on the blocks or super-nodes''.  Such a
decomposition is called a \emph{supernodal} Cholesky factorization.

Because the number and positions of the nonzeros in $\bOmega(\btheta)$
do not change with $\btheta$ and the nonzeros in $\bOmega(\btheta)$
are a subset of the nonzeros in $\bZ\trans\bZ$, we only need to
perform the symbolic phase once and we can do so using $\bZ\trans\bZ$.
That is, using $\bZ$ only we can determine the permutation matrix
$\bP$ for all decompositions of the form
$$\label{eq:Cholesky} \bP\left[\bZ\trans\bZ+\bOmega(\btheta)\right]\bP\trans = \bL(\btheta)\bL(\btheta)\trans$$

We revise (\ref{eq:lmeGeneral}) by incorporating the permutation to obtain
$$\label{eq:lmePerm} \by=\bX\bbeta+\bZ\bP\trans\bP\bb+\beps\quad \beps\sim\mathcal{N}(\bzer,\sigma^2\bI), \bb\sim\mathcal{N}(\bzer,\sigma^2\bSigma), \beps\perp\bb$$

\subsection{Extent of the sparsity}
\label{sec:sparsity}

Table~\ref{tbl:sparse} shows the extent of the sparsity of the
matrices $\bZ$, $\bZ\trans\bZ$ and $\bL$ in our examples.

The matrix $\bL$ is the supernodal representation of the left Cholesky
factor of $\bP\left(\bZ\trans\bZ+\bOmega\right)\bP\trans$.  Because
the fill-reducing permutation $\bP$ has been applied the number of
nonzeros in $\bL$ will generally be less than the number of nonzeros
in the left Cholesky factor of $\bZ\trans\bZ+\bOmega$.  However, when
any supernodes of $\bL$ contain more than one column there will be
elements above the diagonal of $\bL$ stored and these elements are
necessarily zero.  They are stored in the supernodal factorization so
that the diagonal block for a supernode can be treated as a dense
rectangular matrix.

Generally storing these few elements from above the diagonal is a
minor expense compared to the speed savings of using dense matrix
BLAS (Basic Linear Algebra Subroutines) code.  Furthermore they are
never used in computation because the diagonal block is treated as a
densely stored lower triangular matrix.  We mention this so the
interested reader can understand in \code{Cm1}, for example,
$\nz(\bZ\trans\bZ)$ is 54 ($=18\times 3$) yet $\nz(\bL)$ is 72
($=18\times 4$).  In this example there is no fill-in so the number of
nonzeros in $\bL$ is actually 54 but the matrix is stored as 18
lower-triangular blocks that are each $2\times 2$.
\begin{table}[tb]
\centering
\begin{tabular}{r r r r r r r r r r r}
\hline\\
\multicolumn{1}{c}{Name}&\multicolumn{1}{c}{$n$}& \multicolumn{1}{c}{$k$}&
\multicolumn{1}{c}{$q$}&
\multicolumn{1}{c}{$\nz(\bZ)$}&\multicolumn{1}{c}{$\spr(\bZ)$}&
\multicolumn{1}{c}{$\nz(\bZ\trans\bZ)$}&\multicolumn{1}{c}{$\spr(\bZ\trans\bZ)$}&
\multicolumn{1}{c}{$\nz(\bL)$}&\multicolumn{1}{c}{$\spr(\bL)$}\\
\hline
\code{Sm1}&  180&1&   36&  360&0.0556&   54&0.0811&   72&0.1081\\
\code{Cm1}&31022&2&   36&  360&0.0556&   54&0.0811&   72&0.1081\\
\hline
\end{tabular}
\caption{Summary of the sparsity of the model matrix $\bZ$, its
crossproduct matrix $\bZ\trans\bZ$ and the left Cholesky factor
$\bL$ in the examples.  The notation $\nz{}$ indicates the number of nonzeros in the
matrix and $\spr{}$ indicates the sparsity index (the fraction of the elements
in the matrix that are non-zero). Because $\bZ\trans\bZ$ is
symmetric only the nonzeros in the upper triangle are counted and
the sparsity index is relative to the total number of elements in
the upper triangle.  The number of nonzeros and the sparsity index
of $\bL$ are calculated for the supernodal decomposition, which
can include some elements that are above the diagonal and hence must
be zero.}
\label{tbl:sparse}
\end{table}

\section{Likelihood and restricted likelihood}
\label{sec:likelihood}

In general the \emph{maximum likelihood estimates} of the parameters
in a statistical model are those values of the parameters that
maximize the likelihood function, which is the same numerical value as
the probability density of $\by$ given the parameters but regarded as
a function of the parameters given $\by$, not as a function of $\by$
given the parameters.

For model (\ref{eq:lmePerm}) the parameters are
$\bbeta$, $\sigma^2$ and $\btheta$ (as described in
\S\ref{sec:relativePrecision}, $\btheta$ and $\sigma^2$ jointly
determine $\bSigma$) so we evaluate the likelihood $L(\bbeta, \sigma^2, \btheta|\by)$ as
$$\label{eq:Likelihood1} L(\bbeta, \sigma^2, \btheta|\by) = f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$$
where $f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$
is the marginal probability density for $\by$ given the parameters.

Because we will need to write several different marginal and
conditional probability densities in this section and expressions like
$f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$ are
difficult to read, we will adopt a convention sometimes used in the
Bayesian inference literature that a conditional expression in square
brackets indicates the probability density of the quantity on the left
of the $|$ given the quantities on the right of the $|$.  That is
$$\label{eq:Bayesnotation} \left[\by|\bbeta,\sigma^2,\btheta\right]=f_{\by|\bbeta,\sigma^2,\btheta}(\by|\bbeta,\sigma^2,\btheta)$$

Model (\ref{eq:lmePerm}) specifies the conditional distributions
$$\label{eq:ycond} \left[\by|\bbeta,\sigma^2,\bb\right]= \frac{\exp\left\{-\|\by-\bX\bbeta-\bZ\bP\trans\bP\bb\|^2/\left(2\sigma^2\right)\right\}} {\left(2\pi\sigma^2\right)^{n/2}}$$
and
$$\label{eq:bmarg} \begin{split} \left[\bb|\btheta,\sigma^2\right]&= \frac{\exp\left\{-\bb\trans\bSigma^{-1}\bb/2\right\}} {|\bSigma|^{1/2}\left(2\pi\right)^{q/2}}\\ &=\frac{|\bOmega|^{1/2}\exp\left\{-\bb\trans\bOmega\bb/\left(2\sigma^2\right)\right\}} {\left(2\pi\sigma^2\right)^{q/2}} \end{split}$$
from which we can derive the marginal distribution
$$\label{eq:ymarg} \begin{split} \left[\bb|\btheta,\sigma^2\right]&= \int_{\bb}\left[\by|\bbeta,\sigma^2,\bb\right]\left[\bb|\btheta,\sigma^2\right]\,d\bb\\ \frac{|\bOmega|^{1/2}}{\left(2\pi\sigma^2\right)^{n/2}} \int_{\bb} \frac{\exp\left\{-\left[\|\by-\bX\bbeta-\bZ\bP\trans\bP\bb\|^2+\bb\trans\bOmega\bb\right] /\left(2\sigma^2\right)\right\}} {\left(2\pi\sigma^2\right)^{q/2}}\,d\bb \end{split}$$

\bibliography{lme4}
\end{document}