% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
\documentclass[english, a4paper, 10pt, headings=small, DIV11]{scrartcl}
\usepackage{longtable}
\SweaveInput{vignettes.defs}
\SweaveOpts{prefix.string=introduction-fig-}
\hypersetup{pdftitle={hyperSpec Introduction},
pdfauthor={C. Beleites},
pdfsubject={Introduction on the R package hyperSpec},
pdfkeywords={hyperSpec}}
% \VignetteIndexEntry{hyperSpec Introduction}
% \VignetteKeywords{hyperspec}
% \VignettePackage{hyperSpec}
% \VignetteDepends{MASS}
\usepackage{makeidx}
\makeindex
\begin{document}
\title{\phy Introduction }
\maketitle
\warnbox{Reproducing the Examples in this Vignette}{
All spectra used in this manual are installed automatically with \phy.
Note that some definitions are executed in \texttt{vignette.defs}.
}
\tableofcontents
\warnbox[blue!50!black]{Suggested Packages}{
To build this vignette, some packages are suggested but not strictly needed:
\begin{labeling}{latticeExtra}
<>=
check.req.pkg ("pls", special = list (msc = function (x) {texterrormsg ("msc", "pls"); x}))
check.req.pkg ("baseline",
special = list (
baseline = function (x) {texterrormsg ("baseline", "baseline"); x},
getCorrected = function (x) {texterrormsg ("getCorrected", "baseline"); x}
))
check.req.pkg ("ggplot2", donothing = "")
check.req.pkg ("compiler", donothing = "")
check.req.pkg ("inline", donothing = "")
@
%
\end{labeling}
}
\section{Introduction}
\phy is a R package that allows convenient handling of \index{hyperspectral data sets}hyperspectral
data sets, \ie data sets combining spectra with further data on a per-spectrum basis. The spectra can
be anything that is recorded over a common discretized axis.
This vignette gives an introduction on basic working techniques using the R package \phy. This is
done mostly from a spectroscopic point of view: rather than going through the functions provided by
\phy, it is organized in spectroscopic tasks. However, the functions discussed are printed on the
margin for a quick overview.
\phy comes with five data sets,
\begin{labeling}{wavelength: }
\item [\Robject{chondro}] \index{data sets!chondro}a Raman map of chondrocytes in cartilage,
\item [\Robject{flu}] \index{data sets!flu}a set of fluorescence spectra of a calibration series, and
\item [\Robject{laser}] \index{data sets!laser}a time series of an unstable laser emission
\item [\Robject{paracetamol}] \index{data sets!paracetamol}a Raman spectrum of paracetamol
(acetaminophene) ranging from 100 to \rcm{3200} with overlapping wavelength ranges.
\item [\Robject{barbiturates}] \index{data sets!barbiturates}GC-MS spectra with differing wavelength
axes as a list of \Sexpr{length (barbiturates)} \chy objects.
\end{labeling}
In this vignette, the data sets are used to illustrate appropriate procedures for different tasks and
different spectra. In addition, the first three data sets are accompanied by their own vignettes
showing exemplary work flows for the respective data type.
This document describes how to accomplish typical tasks in the analysis of spectra. It does not give
a complete reference on particular functions. It is therefore recommended to look up the methods in
R's help system using \Rcode{?~command}.
A complete list of the functions available in \phy is given in
appendix~\ref{tab:functions}~(p.~\pageref{tab:functions}).
\subsection{Notation and Terms}
Throughout the documentation of the package, the following terms are used:
\begin{labeling}{wavelength: }
\item [wavelength:] \index{wavelength}spectral abscissa\\
frequency, wavenumbers, chemical shift, Raman shift,
$\frac{m}{z}$, etc.
\item [intensity:] \index{intensity}spectral ordinate\\
transmission, absorbance, $\frac{e^{-}}{s}$, intensity, \textellipsis
\item [extra data:] \index{extra data}further information/data belonging to each spectrum\\
spatial information (spectral images, maps, or profiles), temporal information
(kinetics, time series), concentrations (calibration series), class membership information, etc.\\
\chy object may contain arbitrary numbers of extra data columns.
\end{labeling}
In R, slots of a S4 class are accessed by the \index{\Rcommand{"@} operator}\Rcommand{@} operator. In
this vignette, the notation \Rcode{@xxx} will thus mean \emph{``slot xxx of an object''}. Likewise,
named elements of a \Rclass{list} and columns of a \df are accessed by the \index{\Rcommand{\$}
operator}\Rcommand{\$} operator, and \Rcode{\$xxx} will be used for \emph{``column xxx''}, and as an
abbreviation for \emph{``column xxx of the \df in slot data of the object''} (the structure of \chy
objects is discussed in section~\ref{sec:structure-chy}, p.~\pageref{sec:structure-chy}).
\section{Remarks on R}
\subsection{Generic Functions}
\emph{\index{Generic Functions}Generic Functions} are functions that apply to a wide range of data
types or classes, \eg \Rmethod{plot}, \Rmethod{print}, mathematical operators, etc. These functions
can be implemented in a specialized way by each class. \Rclass{hyperSpec} implements with a variety
of such functions, see appendix~\ref{tab:functions} (p.~\pageref{tab:functions}).
\subsection{Functionality Can be Extended at Runtime}
\R's concept of functions offers much flexibility. Functions may be added or changed by the user in
his \emph{workspace} at any time. This is also true for methods belonging to a certain class.
Neither restart of R nor reloading of the package or anything the like is needed. If the original
function resides in a namespace (as it is the case for all functions in \phy), the original function
is not deleted. It is just masked by the user's new function but stays accessible via the \Rcode{::}
operator.
The same is true for ``normal'' variables: You may create changed copies of the example data sets,
work with these and then ``reset'' to \phy's version of the data set by removing the object in your
workspace.
This offers the opportunity of easily writing specialized functions that are adapted to specific
tasks. \phy's vignettes use this to set up special versions of the lattice graphics functions that
are already wrapped in \Rfunction{print} (see also
\href{http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-do-lattice_002ftrellis-graphics-not-work_003f}{R
FAQ: Why do lattice/trellis graphics not work?}) and allow the code in the code chunks of the
vignettes to be exactly what one would type during an interactive \R session. For the code, check the
\texttt{vignettes.defs} file accompanying all \phy vignettes.
\subsection{Validity Checking}
S4 classes have a mechanism to define and enforce that the data actually stored in the object is
appropriate for this class. In other words, there is a mechanism of \emph{\index{validity checking}validity checking}.
The functions provided by \phy check the validity of \Rclass{hyperSpec} objects at the beginning, and
-- if the validity could be broken by inappropriate arguments -- also before leaving the
function. \index{validObject|texttt}\index{chk.hy|texttt} \mFun{validObject, chk.hy}
It is highly recommended to use validity checking also for user-defined functions. In addition,
non-generic functions should first ensure that the argument actually is a \chy object. The two tasks
are accomplished by:
<>=
chk.hy (object)
validObject (object)
@
%
The first line checks whether \Robject{object} is a \chy object, the second checks its validity.
Both functions return \Rcode{TRUE} if the checks succeed, otherwise they raise an error and stop.
\subsection{Special Function Names}
\label{sec:spec-funct-names}
\subsubsection{The Names of Operators}
\label{sec:names-operators}
\index{operators}Operators such as \verb/+/, \verb+-+,\verb+*+, \verb+%%+, etc. are in fact functions
in R. Thus they
can be handed over as arguments to other functions (particularly to the vectorization functions
\Rfunction{*apply}, \Rfunction{sweep}, etc.). In this case the name of the function must be quoted:
\verb+`-`+ is the recommended style (although \verb+"-"+ will often work as well), \eg:
<>=
sweep (flu, 2, mean, `-`)
@
%
These functions can also be called in a more function-like style (prefix notation):
<<>>=
`+` (3, 5)
@
%
\subsubsection{Assignment Functions}
\label{sec:assignment-functions}
\index{assignment functions}
\index{<-@see assignment functions}
R allows the definition of functions that do an assignment (set some part of the object), such as:
<>=
wl (flu) <- new.wavelength.values
@ The actual name of the function is \verb+wl<-+ and must be quoted in order to avoid confusion with
an assignment to variable \verb+wl+: \verb+`wl<-`+.
\section{Loading and the package and configuration}
\label{sec:loading-package}
To \index{loading}load \phy, use\nopagebreak
<>=
library ("hyperSpec")
@
%
\index{options|textbf}
\index{options!debuglevel}
\index{options!gc}
\index{options!log}
The global behaviour of \phy can be configured via options. The values of the options are
retrieved with \Rfunction{hy.getOptions} and \Rfunction{hy.getOption}, and changed with
\Rfunction{hy.setOptions}. Table~\ref{tab:options} gives an overview.
\begin{table}[tb]
\begin{small}
\begin{tabular}{p{.1\textwidth}p{.15\textwidth}p{.45\textwidth}p{.2\textwidth}}
\hline
\textbf{name} & \textbf{default value} (range) & \textbf{description} & \textbf{used by} \\
\cmidrule(lr){1-1} \cmidrule(lr){2-2} \cmidrule(lr){3-3} \cmidrule(lr){4-4}
debuglevel & 0 (1L 2L) & amount of debugging information produced & \Rfunction{spc.identify}, \Rfunction{map.identify} \\
gc & FALSE & triggers frequent calling of \Rcode{gc ()} & \Rfunction{read.ENVI},\newline\Rcode{new ("hyperSpec")} \\
log & TRUE & automatically create logbook entries & \Rfunction{logbook} \\
\hline
%
\end{tabular}
\end{small}
\caption{\phy options. Please refer to the documentation of the respective functions for details about the effect of the options.}
\label{tab:options}
\end{table}
<>=
stopifnot (all (names (hy.getOptions(TRUE)) %in% c ("log", "debuglevel", "gc")))
@
\section{The structure of \Rclass{hyperSpec} objects}
\label{sec:structure-chy}
\Rclass{hyperSpec} is a S4 (or new-style) class. Four slots contain the parts of the object:
\begin{labeling}{@wavelength }
\item [{\Robject{@wavelength}}] containing a numeric vector with the wavelength axis of the spectra.
\item [{\Robject{@data}}] a \df with the spectra and all further information belonging to the spectra
\item [{\Robject{@label}}] a list with appropriate labels (particularly for axis annotations)
\item [{\Robject{@log}}] a \df keeping track of what is done with the object
\end{labeling}
While the parts of the \chy object can be accessed directly, it is good practice to use the functions
provided by \phy to handle the objects rather than accessing the slots directly
(tab.~\ref{tab:getset}). This also ensures that proper (\emph{valid}) objects are returned. In some
cases, however, direct access to the slots can considerably speed up calculations, see section
\ref{sec:speed-considerations} (p.~\pageref{sec:speed-considerations}).
\begin{table}[t]
\centering
\begin{tabular}{>{\Robject}lll}
\toprule
\textbf{slot} & \textbf{get} & \textbf{set} \\
\cmidrule(lr){1-1} \cmidrule(lr){2-2} \cmidrule(lr){3-3}
@wavelength & \verb+wl+ & wl<- \\
@data & \verb+[+, \verb+[[+, \verb+$+, \verb+as.data.frame+, \verb+as.long.df+, \ldots & \verb+[<-+, \verb+[[<-+, \verb+$<-+ \\ %] emacs font lock gets confused otherwise
@label & \verb+labels+ & \verb+labels<-+ \\
@log & \verb+logbook+ & \verb+logentry+ \\
\bottomrule
\end{tabular}
\caption{Get and set functions for the slots of \chy objects}
\label{tab:getset}
\end{table}
Most of the data is stored in \Robject{@data}. This \df has one special column, \Rcode{\$spc}. It is
the column that actually contains the spectra. The spectra are stored in a matrix inside this column,
as illustrated in figure~\ref{fig:structure}. Even if there are no spectra, \Rcode{\$spc} must still
be present. It is then a matrix with zero columns.
\begin{figure}[bt]
\noindent \centering
\includegraphics{strukturhyperspec}
\caption{\label{fig:structure}The structure of the data in a \phy object.}
\end{figure}
Slot \Rcode{@label} contains an element for each of the columns in \Rcode{@data} plus one holding the
label for the wavelength axis, \Rcode{.wavelength}. They are accessed by their names which must be
the same for columns of \Rcode{@data} and the list elements. The elements of the list may be anything
suitable for axis annotations, \ie they should be either character strings or expressions for
``pretty'' axis annotations (see \eg figure~\ref{fig-spcloess} on page~\pageref{fig-spcloess}). To
get familiar with expressions for axis annotation, see \verb+? plotmath+ and \verb+demo (plotmath)+.
\section{Functions provided by \phy}
\label{sec:funct-prov-phy}
Table~\ref{tab:functions}~(p.~\pageref{tab:functions}) in the appendix gives an overview of the
functions implemented by \phy.
\section{Obtaining Basic Information about \phy Objects}
As \mFun{print, show, summary} usual, the \Rmethod{print} and \Rmethod{show} methods display
information about the object, and \Rmethod{summary} yields some additional details about the data
handling done so far:
<>=
chondro
summary (chondro)
@
%
The\mFun{nrow, ncol, nwl, dim} data set \Robject{chondro} consists of \Sexpr{nrow (chondro)} spectra
with \Sexpr{nwl (chondro)} data points each, and \Sexpr{ncol (chondro)} data columns: two for the
spatial information, one factor with the results of a cluster analysis plus \Robject{\$spc}. These
information can be directly obtained by
<>=
nrow (chondro)
nwl (chondro)
ncol (chondro)
dim (chondro)
@
The names of the columns in \Rcode{@data} are accessed by\mFun{colnames, rownames, dimnames, wl}
<>=
colnames (chondro)
@
Likewise, \Rmethod{rownames} returns the names assigned to the spectra, and \Rmethod{dimnames} yields
a list of these three vectors (including also the column names of \Robject{\$spc}). The column names
of the spectra matrix contain the wavelengths as character, while \Rfunction{wl} (see
section~\ref{sec:wavel-axis-conv}, p.~\pageref{sec:wavel-axis-conv}) yields the numeric vector of
wavelengths.
Extra\mFun{colnames<-, rownames<-} data column names and rownames of the object may be set by
\Rfunction{colnames<-} and \Rfunction{rownames<-}, respectively. \Rfunction{colnames<-} renames the
labels as well.
\section{Creating a \Rclass{hyperSpec} Object, Data Import and Export}
\label{sec:create}
\phy comes with filters for a variety of file formats. These are discussed in detail in a separate
vignette accessible via \verb+vignette ("fileio")+.
\subsection{Creating a \chy Object from Spectra Matrix
and Wavelength Vector}
If the data is in R's workspace, a \Rclass{hyperSpec} object is created by:\\
<>=
spc <- new ("hyperSpec", spc = spectra.matrix, wavelength = wavelength.vector, data = extra.data)
@
%
The most frequently needed arguments are:
\begin{labeling}{\Rcode{wavelength}}
\item [{\Rcode{spc}}] the spectra matrix
\item [{\Rcode{wavelength}}] the wavelength axis vector
\item [{\Rcode{data}}] the extra data (can already contain the spectra matrix in column \Rcode{\$spc})
\item [{\Rcode{label}}] a list with the proper labels. Do not forget the wavelength axis label in
\Rcode{\$.wavelength} and the spectral intensity axis label in \Rcode{\$spc}.
\end{labeling}
More information about converting existing data into \chy objects can be found in \Rcode{vignette ("fileio")}.
\subsection{Creating Random Spectra}
If \Rpackage{mvtnorm} is available, multivariate normally distributed spectra can be generated from
mean and covariance matrix using \Rfunction{rmmvnorm}\mFun{rmmvnorm} (fig.~\ref{fig:sim:spc}). Note
that the \phy function's name has an additional ``m'': it already takes care of multiple groups.
Mean spectra and pooled covariance matrix can be calculated using
\Rfunction{pooled.cov}\mFun{pooled.cov}:
<<>>=
pcov <- pooled.cov (chondro, chondro$clusters)
rnd <- rmmvnorm (rep (10, 3), mean = pcov$mean, sigma = pcov$COV)
@
<>=
cluster.cols <- c ("dark blue", "orange", "#C02020")
plot (rnd, col = cluster.cols [rnd$.group])
@
\begin{figure}[t]
\subfloat[\label{fig:sim:spc} \Rfunction{rmmvnorm}]{\includegraphics[width=.66\linewidth]{introduction-fig--simspc}}
\subfloat[\label{fig:sim:lda} LDA of simulated spectra. Crosses mark real spectra.]{\includegraphics[width=.33\linewidth]{introduction-fig--simlda}}
\caption{Multivariate normally distributed random spectra.}
\label{fig:sim}
\end{figure}
fig.~\ref{fig:sim:lda} shows the linear discriminant analysis (LDA) scores of such simulated specta in comparison to the real spectra in the \Robject{chondro} object:
@
<>=
require ("MASS")
rnd <- rmmvnorm (rep (200, 3), mean = pcov$mean, sigma = pcov$COV)
lda <- lda (clusters ~ spc, rnd)
pred.chondro <- predict (lda, chondro)
pred.sim <- predict (lda)
@
<>=
colors <- c("#00008040", "#FFA50040", "#C0202040")
plot (pred.chondro$x, col = colors [chondro$clusters], pch = 3)
points (pred.sim$x, col = colors [rnd$clusters], pch = 20, cex = 0.5)
@
If individual covariance matrices should be used for each group, \Rfunarg{sigma} should be an array
with the 3rd dimension corresponding to the group.
\section{The Logbook}
\label{sec:logbook}
\warnbox{Deprecated}{The logbook is now DEPRECATED and the functionality will be removed in the future.\\
This feature has never seen much use, but slows down \phy considerably.}
Slot \Rcode{@log} of \chy objects is intended to keep track of the history of the object. This
logbook part of the output of the \Rfunction{summary}, and can also be retrieved by
\Rfunction{logbook}.
<>=
logbook (flu)
@
New entries can be created manually by calling \Rfunction{logentry}\mFun{\Rfunction{logentry},
\Rfunction{logbook}}:
<<>>=
tmp <- logentry (flu, short = "test", long = "This could also be a list of parameters")
logbook (tmp)
@
In addition, \phy by default logs automatically all changes to the object:
<<>>=
tmp <- tmp [1:3]
logbook (tmp)
@
%
The automatic logging mechanism can only log function calls and parameters (as opposed to the
intention of the function call). \phy functions that return a changed object allow to use more
meaningful short descriptions: they are assigned via the argument \Rfunarg{short}:
<<>>=
tmp <- sweep (tmp, 2, mean, short = "centering")
logbook (tmp)
@
\index{options!log} Automatic logging may be turned off by
<>=
hy.setOptions (log = FALSE)
@
%
This can help optimizing program execution speed, see section~\ref{sec:speed-considerations}
(p.~\ref{sec:speed-considerations}).
<>=
rm (tmp)
@
\section{Access to the data}
\label{sec:access-parts}
The main functions to retrieve the data of a \chy object are \Rfunction{[]}\mFun{\Rfunction{[]},
\Rfunction{[[]]}} and \Rfunction{[[]]}.
The difference between these functions is that \Rfunction{[]} returns a \chy object, whereas the
result of \Rfunction{[[]]} is a \df if extra data columns were selected or otherwise the
spectra matrix. Single extra data columns may be retrieved by \Rfunction{\$}\mFun{\Rfunction{\$}}.
In order to change data, use \Rfunction{[]<-}, \Rfunction{[[]]<-}, and
\Rfunction{\$<-}\mFun{\Rfunction{[]<-}, \Rfunction{[[]]<-}, \Rfunction{\$<-}}
(see~\ref{sec:square-brack-replace} and \ref{sec:accessing-extra-data}).
\subsection{Access Functions and Abbreviations for Parts of the \chy Object's Data}
\label{sec:fast-access-parts}
\mFun{\Rfunction{[] [[]] \$. \$.. []<- [[]]<- \$<-}}
\Rclass{hyperSpec} comes with three abbreviation functions for easy access to the data:
\begin{labeling}{\Robject{x} \Rfunction{[[\Rfunarg{i}, , \Rfunarg{l}]] <-} }
\item [\Robject{x} \Rfunction{[[]]}] returns the spectra matrix (\Rcode{x\$spc}).
\item [\Robject{x} \Rfunction{[[\Rfunarg{i}, , \Rfunarg{l}]]}] the cut spectra matrix is returned if
wavelengths are specified in \Rfunarg{l}.
\item [\Robject{x} \Rfunction{[[\Rfunarg{i}, \Rfunarg{j}, \Rfunarg{l}]]}] If data columns are
selected (second index), the result is a \df.
\item [\Robject{x} \Rfunction{[[\Rfunarg{i}, , \Rfunarg{l}]] <-}] Also, parts of the spectra matrix
can be set (only indices for spectra and wavelength are allowed for this function).
\item [\Robject{x} \Rfunction{[\Rfunarg{i}, \Rfunarg{j}] <-}] sets parts of \Rcode{x@data}.
\item [\Robject{x} \Rfunction{\$.}] returns the complete \df \Rcode{x@data}, with the
spectra in column \Rcode{\$spc}.
\item [\Robject{x} \Rfunction{\$..}] returns the extra data (\Rcode{x@data} without \Rcode{x\$spc}).
\item [\Robject{x} \Rfunction{\$.. <-}] sets the extra data (\Rcode{x@data} without
\Rcode{x\$spc}). The columns must match exactly in this case.
\end{labeling}
\subsection{Selecting and Deleting Spectra}
\label{sec:select-delete-spectra}
The extraction function \Rfunction{[]} takes the spectra as first argument (For detailed help: see
\verb+? `[`+). It may be a vector giving the indices of the spectra to extract (select), a vector
with negative indices indicating which spectra should be deleted, or a logical. Note that a matrix
given to \Rfunction{[]} will be treated as a vector.
<>=
plot (flu, col = "gray")
plot (flu [1 : 3], add = TRUE)
@
<>=
plot (flu, col = "gray")
plot (flu [-3], add = TRUE)
@
<>=
plot (flu, col = "gray")
plot (flu [flu$c > 0.2], add = TRUE)
@
\subsubsection{Random Samples}
\label{sec:random-samples}
A random subset of spectra is conveniently selected by \Rfunction{sample}
\mFun{\Rfunction{sample}}:
<>=
sample (chondro, 3)
@
If appropriate indices into the spectra are needed instead, use \Rfunction{isample}\mFun{\Rfunction{isample}}:
<>=
isample (chondro, 3)
@
\subsubsection{Sequences}
\label{sec:seq}
Sequences of every n\textsuperscript{th} spectrum or the like can be retrieved with \Rfunction{seq}\mFun{\Rfunction{seq}}:
<>=
seq (chondro, length.out = 3, index = TRUE)
seq (chondro, by = 100)
@
Here, indices may be requested using \Rcode{index = TRUE}.
\subsection{Selecting Extra Data Columns }
\label{sec:accessing-extra-data}
The second argument of the extraction functions \Rfunction{[]} and \Rfunction{[[]]} specifies the
(extra) data columns. They can be given like any column specification for a \df, \ie
numeric, logical, or by a vector of the column names:
<>=
colnames (chondro)
chondro [[1 : 3, 1]]
chondro [[1 : 3, -4]]
chondro [[1 : 3, "x"]]
chondro [[1 : 3, c (TRUE, FALSE)]] # note the recycling!
@
To select one column, the \Rfunction{\$} operator is more convenient\mFun{\Rfunction{\$}}:
<>=
flu$c
@
\chy supports command line completion for the \Rcode{\$} operator.
The extra data may also be set this way\mFun{\Rfunction{\$<-}}:
<>=
flu$n <- list (1 : 6, label = "sample no.")
@
This function will append new columns, if necessary.
\subsection{More on the \Rfunction{[[]]} and \Rfunction{[[]]<-} Operators: Accessing Single Elements of the Spectra Matrix}
\label{sec:square-brack-replace}
\Rfunction{[[]]} works mostly analogous to []. In addition, however, these two functions also accept
index matrices of size $n \times 2$. In this case, a vector of values from the spectra matrix is
returned.
<>=
indexmatrix <- matrix (c (1 : 3, 1 : 3), ncol = 2)
indexmatrix
chondro [[indexmatrix, wl.index = TRUE]]
diag (chondro [[1 : 3, , min ~ min + 2i]])
@
\Rfunction{[[]]<-} also accepts index matrices of size $n \times 2$.
<>=
indexmatrix <- matrix (c (1 : 3, 1 : 3), ncol = 2)
indexmatrix
chondro [[indexmatrix, wl.index = TRUE]]
diag (chondro [[1 : 3, , min ~ min + 2i]])
@
\subsection{Wavelengths}
\label{sec:wavelength-axis}
\index{wavelength!formula notation}
\index{wavelength!conversion to index}
\index{wavelength indices!conversion to wavelength}
\subsubsection{Converting Wavelengths to Indices and vice versa}
\label{sec:wavelength-indices}
Spectra\mFun{\Rfunction{wl2i} \Rfunction{i2wl}} in \phy have always discretized wavelength axes, they
are stored in a matrix with each column corresponding to one wavelength. \phy provides two functions
to convert the respective column indices into wavelengths and vice versa: \Rfunction{i2wl} and
\Rfunction{wl2i}.
If the wavelengths are given as a numeric vector, they are each converted to the corresponding
wavelength. In addition there is a more sophisticated possibility of specifying wavelength ranges
using a \emph{formula}. The basic syntax is \emph{start~}$\sim$~\emph{end.} This yields a vector
\emph{index of start }\Rcode{:}\emph{ index of end.}
The result of the formula conversion differs from the numeric vector conversion in three ways:
\begin{itemize}
\item The colon operator for constructing vectors accepts only integer numbers, the tilde (for
formulas) does not have this restriction.
\item If the vector does not take into account the spectral resolution, one may get only every
$n$\textsuperscript{th} point or repetitions of the same index:\\
<>=
wl2i (flu, 405 : 410)
@
<>=
wl2i (flu, 405 ~ 410)
@
<>=
wl2i (chondro, 1000 : 1010)
@
<>=
wl2i (chondro, 1000 ~ 1010)
@
\item If the object's wavelength axis is not ordered, the formula approach will give weird
results. In that (probably rare) case, use \Rfunction{orderwl} first to obtain an object with
ordered wavelength axis.
\end{itemize}
\emph{start} and \emph{end} may contain the special variables \Robject{min} and \Robject{max} that
correspond to the lowest and highest wavelengths of the object:
<>=
wl2i (flu, min ~ 410)
@
Often, specifications like \emph{wavelength \textpm $n$\ data points} are needed. They can be given
using complex numbers in the formula. The imaginary part is added to the index calculated from the
wavelength in the real part:
<>=
wl2i (flu, 450 - 2i ~ 450 + 2i)
wl2i (flu, max - 2i ~ max)
@
To specify several wavelength ranges, use a list containing the formulas
and vectors\footnote{Formulas are combined to a list by \Rfunction{c}.}:
<>=
wl2i (flu, c (min ~ 406.5, max - 2i ~ max))
@
This mechanism also works for the wavelength arguments of \Rfunction{[]}, \Rfunction{[[]]}, and
\Rfunction{plotspc}.
\subsubsection{Selecting Wavelength Ranges}
Wavelength ranges can easily be selected using \Rfunction{[]}'s third argument:
<>=
plot (paracetamol [,, 2800 ~ 3200])
@
\\
By default, the values given are treated as wavelengths. If they are indices into the columns of the
spectra matrix, use \Rcode{wl.index = TRUE}:
<>=
plot (paracetamol [,, 2800 : 3200, wl.index = TRUE])
@
\\
Section~\ref{sec:wavelength-indices}~(p.~\pageref{sec:wavelength-indices}) details into the different
possibilities of specifying wavelengths.
\subsubsection{Deleting Wavelength Ranges}
\label{sec:del-wavelengths}
Deleting wavelength ranges may be accomplished using negative index vectors together with \Rcode{wl.index = TRUE}.
<>=
plot (paracetamol [,, -(500 : 1000), wl.index = TRUE])
@
\\
However, this mechanism works only if the proper indices are known.
If the range to be cut out is rather known in the units of the wavelength axis, it is easier to
select the remainder of the spectrum instead. To delete the spectral range from 1750 to
2800\,cm\textsuperscript{-1} of the paracetamol spectrum one can thus use:
<>=
plot (paracetamol [,, c (min ~ 1750, 2800 ~ max)])
@
\\
(It is possible to produce a plot of this data where the cut range is actually omitted and the
wavelength axis is optionally cut in order to save space. For details see the ``plotting'' vignette).
\subsubsection{Changing the Wavelength Axis}
\label{sec:wavel-axis-conv}
\index{wavelength!conversion} Sometimes wavelength axes need to be transformed, \eg converting from
wavelengths to frequencies. In this case, retrieve the wavelength axis vector with
\Rfunction{wl}\mFun{\Rfunction{wl}, \Rfunction{wl<-}}, convert each value of the resulting vector and
assign the result with \Rfunction{wl<-}. Also the label of the wavelength axis may need to be
adjusted.
As an example, convert the wavelength axis of \Robject{laser} to frequencies. As the wavelengths are
in nanometers, and the frequencies are easiest expressed in terahertz, an additional conversion
factor of 1000 is needed:
<>=
laser
wavelengths <- wl (laser)
frequencies <- 2.998e8 / wavelengths / 1000
wl (laser) <- frequencies
labels (laser, ".wavelength") <- "f / THz"
laser
rm (laser)
@
There are other possibilities of invoking \Rfunction{wl<-} including the new label, \eg
<<>>=
wl (laser, "f / THz") <- frequencies
@
and
<<>>=
wl (laser) <- list (wl = frequencies, label = "f / THz")
@
see \verb+?`wl<-`+ for more information.
\subsubsection{Ordering the Wavelength Axis}
\label{sec:orderwl}
If the wavelength axis of an object needs reordering (\eg after \Rfunction{collapse}), \Rfunction{orderwl}\mFun{\Rfunction{orderwl}} can be used:
<>=
barb <- collapse (barbiturates [1 : 3])
wl (barb)
barb <- orderwl (barb)
wl (barb)
@
\subsection{Conversion to Long- and Wide-Format \df{}s}
\label{sec:conv-long-form}
\mFun{\Rfunction{as.data.frame}}\Rfunction{as.data.frame} extracts the \Rcode{@data} slot as a \df:
<<>>=
flu <- flu [,,400 ~ 407] # make a small and handy version of the flu data set
as.data.frame (flu)
colnames (as.data.frame (flu))
as.data.frame (flu) $ spc
@
Note that the spectra matrix is still a matrix inside column \Rcode{\$spc}.
\Rfunction{as.data.frame} and the abbreviations \mFun{\Rfunction{\$.},
\Rfunction{\$..}}\Rfunction{\$.} and \Rfunction{\$..} retrieve the usual wide format \df{}s:
<<>>=
flu$.
flu$..
@
If another subset of colums needs to be extracted, use \mFun{\Rfunction{[[]]}}\Rfunction{[[]]}:
<<>>=
flu [[, c ("c", "spc")]]
@
%
This can be combined with extracting certain spectra and wavelengths, see below in subsection
``\nameref{sec:conv-matrix}'' on page \pageref{sec:conv-matrix}.
The transpose of a wide format \df can be obtained by
\mFun{\Rfunction{as.t.df}}\Rfunction{as.t.df}. For further examples, see the discussion of
\Rpackage{ggplot2} in \Rcode{vignette ("plotting")}.
<<>>=
as.t.df (apply (flu, 2, mean_pm_sd))
@
Some functions need the data being an \emph{unstacked} or \emph{long-format}
\df. \mFun{\Rfunction{as.long.df}}\Rfunction{as.long.df} is the appropriate conversion function.
<<>>=
head (as.long.df (flu), 20)
@
\subsection{Conversion to Matrix}
\label{sec:conv-matrix}
\mFun{\Rfunction{as.matrix}, \Rfunction{[[]]}} The spectra matrix is extracted by
\Rfunction{as.matrix}, the convenient abbreviation is \Rfunction{[[]]}:
<<>>=
flu [[]]
class (flu [[]])
@ \Rfunction{[[]]} takes the same arguments as \Rfunction{[]}, and can be used to extract a matrix
containing parts of the spectra matrix:
<<>>=
flu [[1:3,, 406 ~ 407]]
@
If indices for the columns to extract are given, a \df is returned instead of a matrix:
<<>>=
flu [[1:3, c ("file", "spc"), 406 ~ 407]]
@
<<>>=
rm (flu)
@
\section{Combining and Decomposing \chy Objects}
\label{sec:combine}
\subsection{Binding Objects together}
\label{sec:bind}
\Rclass{hyperspec} \mFun{\Rfunction{cbind} \Rfunction{rbind}} Objects can be bound together, either
by columns (\Rfunction{cbind}) to append a new spectral range or by row (\Rfunction{rbind}) to append
new spectra:
<>=
dim (flu)
dim (cbind (flu, flu))
dim (rbind (flu, flu))
@
There is also a more general function, \Rfunction{bind}, taking the direction
(\Rcode{\textquotedbl{}r\textquotedbl{}} or \Rcode{\textquotedbl{}c\textquotedbl{}}) as first
argument followed by the objects to bind either in separate arguments or in a list.
As usual for \Rfunction{rbind} and \Rfunction{cbind}, the objects that should be bound together must
have the same rows and columns, respectively.
For binding row-wise (\Rfunction{rbind}), \Rfunction{collapse}\mFun{\Rfunction{collapse}} is
more flexible but also faster.
\subsection{Binding Objects that do not Share the Same Extra Data and/or Wavelength Axis}
\label{sec:collapse}
\Rfunction{collapse} \mFun{\Rfunction{collapse}} combines objects that should be bound together by
row, but they do not share the columns and/or spectral range. The resulting object has all columns
from all input objects, and all wavelengths from the input objects. If an input object does not have
a particular column or wavelength, its value in the resulting object is \Rcode{NA}.
The \Robject{barbiturates} data is a list of \Sexpr{length (barbiturates)} \chy objects, each
containing one mass spectrum. The spectra have between \Sexpr{(min (sapply (barbiturates, nwl)))} and
\Sexpr{(max (sapply (barbiturates, nwl)))} data points each.
<>=
barb <- collapse (barbiturates)
wl (barb) [1 : 25]
@
The resulting object does not have an ordered wavelength axis. This can be obtained in a second step:
<>=
barb <- orderwl (barb)
barb [[1:3, , min ~ min + 10i]]
@
\subsection{Binding Objects that do not Share the Same Spectra}
\label{sec:merge}
\Rfunction{merge} \mFun{\Rfunction{merge}} adds a new spectral range (like \Rfunction{cbind}), but
works also if spectra are missing in one of the objects. The arguments \Rfunarg{by}, \Rfunarg{by.x},
and \Rfunarg{by.y} specify which columns should be used to decide which spectra are the same. The
arguments \Rfunarg{all}, \Rfunarg{all.x}, and \Rfunarg{all.y} determine whether spectra should be
kept for the result set if they appear in only one of the objects. For details, see also the help on
the base function \Rpackage{merge}.
As an example, let's construct a version of the \Robject{chondro} data like being taken as two maps
with different spectral ranges. In each data set, some spectra are missing.
<>=
chondro.low <- sample (chondro [,, 600 ~ 1200], 700)
nrow (chondro.low)
chondro.high <- sample (chondro [,, 1400 ~ 1800], 700)
nrow (chondro.high)
@
As all extra data columns are the same, no special declarations are needed for merging the data:
<>=
chondro.merged <- merge (chondro.low, chondro.high)
nrow (chondro.merged)
@
%
By default, the result consists of only those spectra, where \emph{both} spectral ranges were
available. To keep all spectra replacing missing parts by \Rcode{NA} (see fig.~\ref{fig:merge}):
<<>>=
chondro.merged <- merge (chondro.low, chondro.high, all = TRUE)
nrow (chondro.merged)
@
\begin{figure}
\setkeys{Gin}{width = .66\textwidth}
\subfloat[\label{fig-merge-map}]{
<>=
print (levelplot (spc ~ x * y | as.factor (paste (.wavelength, " 1/cm")),
chondro.merged [,,c(1000, 1650)],
aspect = "iso", col.regions = matlab.palette ()))
@
}
\setkeys{Gin}{width = .33\textwidth}
\subfloat[\label{fig-merge-mat}]{
<>=
png ("introduction-fig--merged.png", width = 500, height = 425, res=100)
plot (chondro.merged [1 : 100], "mat")
dev.off()
rm (chondro)
@
\includegraphics[width = .33\textwidth]{introduction-fig--merged}
}
\caption[]{\label{fig:merge}
\subref{fig-merge-map} For both spectral ranges some spectra are missing.
\subref{fig-merge-mat} The missing parts of the spectra are filled with \Rcode{NA}.}
\end{figure}
<<>>=
merged <- merge (chondro [1:7,, 610 ~ 620], chondro [5:10,, 615 ~ 625], all = TRUE)
merged$.
@
%
If the spectra overlap, the result will have both data points. In the example here one could easily
delete duplicate wavelengths. For real data, however, the duplicated wavelength will hardly ever
contain the same values. The appropriate method to deal with this situation depends on the data at
hand, but it will usually be some kind of spectral interpolation.
One possibility is removing duplicated wavelengths by using the mean intensity. This can conveniently
be done by using \Rfunction{approx} using \Rcode{method = "constant"}. For duplicated wavelengths,
the intensities will be combined by the tie function. This already defaults to the mean, but we need
\Rcode{na.rm = TRUE}.
Thus, the function to calculate the new spectral intensities is
<>=
approxfun <- function (y, wl, new.wl){
approx (wl, y, new.wl, method = "constant",
ties = function (x) mean (x, na.rm = TRUE)
)$y
}
@
%
which can be applied to the spectra:
<<>>=
merged <- apply (merged, 1, approxfun,
wl = wl (merged), new.wl = unique (wl (merged)),
new.wavelength = "new.wl")
merged$.
@
\subsection{Matrix Multiplication}
\label{sec:matmult}
Two \chy objects can be matrix multiplied by \Rfunction{\%*\%}\mFun{\Rfunction{\%*\%}}. For an
example, see the principal component analysis below (section~\ref{sec:pca} on page
\pageref{sec:pca}).
\subsection{Decomposition}
\label{sec:decomposition}
Matrix decompositions are common operations during chemometric data analysis. The results, \eg of a
principal component analysis are two matrices, the so-called scores and loadings. The results can
have either the same number of rows as the spectra matrix they were calculated from (scores-like), or
they have as many wavelengths as the spectra (loadings-like).
Both types of result objects can be ``re-imported'' into \chy objects with function
\Rfunction{decomposition}\mFun{\Rfunction{decomposition}}. A scores-like object retains all
per-spectrum information (\ie the extra data) while the spectra matrix and wavelength vector are
replaced. A loadings-like object retains the wavelength information, while extra data is deleted
(set to \Rcode{NA}) unless the value is constant for all spectra.
A demonstration can be found in the principal component analysis example (section~\ref{sec:pca}) on
page \pageref{sec:pca}.
\section{Plotting}
\phy offers a variety of possibilities to plot spectra, spectral maps, the spectra matrix, time
series, depth profiles, etc.. This all is discussed in a separate document: see
\verb+vignette ("plotting")+.
\section{Spectral (Pre)processing}
\subsection{Cutting the Spectral Range}\mFun{\Rfunction{[]} \Rfunction{[[]]}}
The extraction functions \Rfunction{[]} and \Rfunction{[[]]} can be used to cut the spectra: Their
third argument takes wavelength specifications as discussed above and also logicals (i.e. vectors
specifying with \Rcode{TRUE}/\Rcode{FALSE} for each column of \Rcode{\$spc}
whether it should be included or not.\\
\Rfunction{[]} returns a \Rclass{hyperSpec} object, \Rfunction{[[]]} the spectra matrix \Rcode{\$spc}
(or the \df \Rcode{@data} if in addition data columns were specified) only.
<>=
flu [,, min ~ 408.5]
flu [[,, c (min ~ min + 2i, max - 2i ~ max)]]
@
\subsection{Shifting Spectra}
\label{sec:shifting-spectra}
Sometimes, spectra need to be aligned along the spectral axis.
In general, two options are available for shifting spectra along the wavelength axis.
\begin{enumerate}
\item The wavelength axis can be shifted, while the intensities stay unaffected.
\item the spectra are interpolated onto a new wavelength axis, while the nominal wavelengths stay.
\end{enumerate}
The first method is very straightforward (see fig~\ref{fig:shift:wl}):
<<>>=
tmp <- chondro
wl (tmp) <- wl (tmp) - 10
@
<>=
plot (chondro [135])
plot (tmp [135,,], add = TRUE, col = "red")
@
but it cannot be used if each spectrum (or groups of spectra) are shifted individually.
In that case, interpolation is needed. R offers many possibilities to interpolate (\eg
\Rfunction{approx} for constant / linear approximation, \Rfunction{spline} for spline interpolation,
\Rfunction{loess} can be used to obtain smoothed approximations, etc.). The appropriate interpolation
strategy will depend on the spectra, and \phy therefore leaves it up to the user to select a sensible
interpolation function.
As an example, we will use natural splines to do the interpolation. It is convenient to set it up as
a function:
<>=
interpolate <- function (spc, shift, wl){
spline (wl + shift, spc, xout = wl, method = "natural")$y
}
@
This function can now be applied to a set of spectra (see fig~\ref{fig:shift:interp}):
<<>>=
tmp <- apply (chondro, 1, interpolate, shift = -10, wl = wl (chondro))
@
<>=
plot (chondro [135])
plot (tmp [135], add = TRUE, col = "red")
@
<>=
tmp <- chondro [135,, 990 ~ 1010]
plot (tmp, lines.args = list (type = "b", pch = 19, cex = 0.5))
wl (tmp) <- wl(tmp) - 0.5
plot (tmp, lines.args = list (type = "b", pch = 19, cex = 0.5), add = TRUE, col = "red")
tmp <- chondro [135]
tmp <- apply (tmp, 1, function (x, wl, shift)
spline (wl + shift, x, xout = wl)$y,
wl = wl (tmp), shift = -0.5)
plot (tmp, lines.args = list (type = "b", pch = 19, cex = 0.5), add = TRUE, col = "blue")
@
If different spectra need to be offset by different shift, use a loop\footnote{\Rfunction{sweep}
cannot be used here, and while there is the possibility to use sapply or mapply, they are not
faster than the for loop in this case. Make sure to work on a copy of the spectra matrix, as that
is much faster than row-wise extracting and changing the spectra by \Rfunction{[[} and
\Rfunction{[[<-}.}
<<>>=
shifts <- rnorm (nrow (chondro))
tmp <- chondro [[]]
for (i in seq_len (nrow (chondro)))
tmp [i, ] <- interpolate (tmp [i, ], shifts [i], wl = wl (chondro))
chondro [[]] <- tmp
@
\begin{figure}[t]
\subfloat[\label{fig:shift:wl} \Rfunction{wl<-}]{\includegraphics[width=.33\linewidth]{introduction-fig--shift-wl}}
\subfloat[\label{fig:shift:interp} interpolation]{\includegraphics[width=.33\linewidth]{introduction-fig--shift-interp}}
\subfloat[\label{fig:shift:untsch} ]{\includegraphics[width=.33\linewidth]{introduction-fig--shift-untsch}}
\caption[]{Shifting the Spectra along the Wavelength Axis. \subref{fig:shift:wl} Changing the
wavelength values. \subref{fig:shift:interp} Interpolation. \subref{fig:shift:untsch} Detail view
of the phenylalanine band: shifting by \Rfunction{wl<-} (red) does not affect the intensities,
while the spectrum is slightly changed by interpolations (blue).}
\label{fig:shift}
\end{figure}
\subsubsection{Calculating the Shift}
\label{sec:calculating-shift}
Often, the shift in the spectra is determined by aligning a particular signal. This strategy works
best with spectrally oversampled data that allows accurate determination of the signal position.
For the \Robject{chondro} data, let's use the maximum of the phenylalanine band between 990 and
\rcm{1020}. As just the very maximum is too coarse, we'll use the maximum of a square polynomial
fitted to the maximum and its two neighbours.
<<>>=
find.max <- function (y, x){
pos <- which.max (y) + (-1:1)
X <- x [pos] - x [pos [2]]
Y <- y [pos] - y [pos [2]]
X <- cbind (1, X, X^2)
coef <- qr.solve (X, Y)
- coef [2] / coef [3] / 2 + x [pos [2]]
}
bandpos <- apply (chondro [[,, 990 ~ 1020]], 1, find.max, wl (chondro [,, 990 ~ 1020]))
refpos <- find.max (colMeans (chondro[[,, 990 ~ 1020]]), wl (chondro [,, 990 ~ 1020]))
shift1 <- refpos - bandpos
@
A second possibility is to optimize the shift. For this strategy, the spectra must be sufficiently
similar, while low spectral resolution is compensated by using larger spectral windows.
<<>>=
chondro <- chondro - spc.fit.poly.below (chondro [,,min+3i ~ max - 3i], chondro)
chondro <- sweep (chondro, 1, rowMeans (chondro [[]], na.rm = TRUE), "/")
@
<<>>=
targetfn <- function (shift, wl, spc, targetspc){
error <- spline (wl + shift, spc, xout = wl)$y - targetspc
sum (error^2)
}
shift2 <- numeric (nrow (chondro))
tmp <- chondro [[]]
target <- colMeans (chondro [[]])
for (i in 1 : nrow (chondro))
shift2 [i] <- unlist (optimize (targetfn, interval = c (-5, 5), wl = chondro@wavelength,
spc = tmp[i,], targetspc = target)$minimum)
@
Figure~\ref{fig:fit-shift} shows that the second correction method works better for the chondrocyte
data. This was expected, as the spectra are hardly or not oversampled, but are very similar to each
other.
\begin{figure}[t]
\centering
<>=
df <- data.frame (shift = c (shifts, shifts + shift1, shifts + shift2),
method = rep (c ("original", "find maximum", "interpolation"),
each = nrow (chondro)))
plot (histogram (~ shift | method, data = df, breaks = do.breaks(range (df$shift), 25),
layout = c (3,1)))
@
\caption{The shifts used to disturb the chondrocyte data (original), and the remaining shift after
correction with the two methods discussed here.}
\label{fig:fit-shift}
\end{figure}
\subsection{Removing Bad Data}
\subsubsection{Bad Spectra}
\label{sec:bad-spectra}
Occasionally, one may want to remove spectra because of too low or too high signal.
E.g. for infrared spectra one may state that the absorbance maximum should be, say, between 0.1 and
1. \chy's comparison operators return a logical matrix of the size of the spectra that is suitable
for later indexing:
<>=
ir.spc <- chondro / 1500 ## fake IR data
high.int <- apply (ir.spc > 1, 1, any) # any point above 1 is bad
low.int <- apply (ir.spc, 1, max) < 0.1 # the maximum should be at least 0.1
ir.spc <- ir.spc [! high.int & ! low.int]
@
\subsubsection{Removing Spectra outside mean $\pm$ $n$ sd}
<<>>=
mean_sd_filter <- function (x, n = 5) {
x <- x - mean (x)
s <- n * sd (x)
(x <= s) & (x > -s)
}
OK <- apply (chondro [[]], 2, mean_sd_filter, n = 4) # logical matrix
spc.OK <- chondro [apply (OK, 1, all)]
@
<>=
plot (chondro [! apply (OK, 1, all)])
i <- which (! OK, arr.ind = TRUE)
points (wl (chondro) [i [,2]], chondro[[!OK]], pch = 19, col = "red", cex = 0.5)
@
\begin{figure}[tb]
\subfloat[mean $\pm$ sd filter]{\includegraphics[width=0.49\textwidth]{introduction-fig--filter}}
\subfloat[remove bad points]{\includegraphics[width=0.49\textwidth]{introduction-fig--bad}}
\caption{filtering data}
\label{fig:filter}
\end{figure}
\subsubsection{Bad Data Points}
\label{sec:bad-data-points}
Assume the data contains once in a while a detector readout of 0:
<<>>=
spc <- chondro [1 : 3,, min ~ min + 15i]
spc [[cbind (1:3, sample (nwl (spc), 3)), wl.index = TRUE]] <- 0
spc [[]]
@
We can set these points to \Rcode{NA}, again using that the comparison returns a suitable logical matrix:
<<>>=
spc [[spc < 1e-4]] <- NA
spc [[]]
@
Depending on the type of analysis, one may wants to replace the \Rcode{NA}s by interpolating the
neighbour values. So far, \phy provides three functions that can interpolate the \Rcode{NA}s:
\mFun{\Rfunction{spc.NA.linapprox}, \Rfunction{spc.loess}, \Rfunction{spc.bin}}:
\Rfunction{spc.NA.linapprox}, \Rfunction{spc.loess}, and \Rfunction{spc.bin} with \Rcode{na.rm =
TRUE} (the latter two are discussed below).
<<>>=
spc.corrected <- spc.NA.linapprox (spc)
spc.corrected [[]]
@
<>=
spc [[is.na (spc)]] <- 0
plot (spc)
spc [[spc < 1e-4]] <- NA
plot (spc.NA.linapprox (spc), add = TRUE, col = "blue", lines.args = list (type = "b", pch = 19, cex = 0.5))
@
\subsubsection{Spikes in Raman Spectra}
\label{sec:spikes-raman-spectra}
...coming soon...
\subsection{Smoothing Interpolation}
\mFun{\Rfunction{spc.bin} \Rfunction{spc.loess}} Spectra acquired by grating instruments are
frequently interpolated onto a new wavelength axis, \eg because the unequal data point spacing should
be removed. Also, the spectra can be smoothed: reducing the spectral resolution allows to increase
the signal to noise ratio. For chemometric data analysis reducing the number of data points per
spectrum may be crucial as it reduces the dimensionality of the data.
\phy provides two functions to do so: \Rfunction{spc.bin} and \Rfunction{spc.loess}.
\Rfunction{spc.bin} bins the spectral axis by averaging every \Rfunarg{by} data points.
<>=
plot (paracetamol, wl.range = c (300 ~ 1800, 2800 ~ max), xoffset = 850)
p <- spc.loess (paracetamol, c(seq (300, 1800, 2), seq (2850, 3150, 2)))
plot (p, wl.range = c (300 ~ 1800, 2800 ~ max), xoffset = 850, col = "red", add = TRUE)
b <- spc.bin (paracetamol, 4)
plot (b, wl.range = c (300 ~ 1800, 2800 ~ max), xoffset = 850,
lines.args = list (pch = 20, cex = .3, type = "p"), col = "blue", add = TRUE)
@
<>=
plot (paracetamol [, , 1600 ~ 1670])
plot (p [, , 1600 ~ 1670], col = "red", add = TRUE)
plot (b [, , 1600 ~ 1670], col = "blue", add = TRUE)
@
%
\begin{figure}
\includegraphics[width=0.33\textwidth]{introduction-fig--fig-loess}
\includegraphics[width=0.33\textwidth]{introduction-fig--fig-loess-kl}
\caption{\label{fig-spcloess}Smoothing interpolation by \Rfunction{spc.loess} with new data point
spacing of 2 cm\textsuperscript{-1} (red) and \Rfunction{spc.bin} (blue). The magnification on the
right shows how interpolation may cause a loss in signal height.}
\end{figure}
\Rfunction{spc.loess} applies R's \Rfunction{loess} function for spectral interpolation. Figure
\ref{fig-spcloess} shows the result of interpolating from 300 to 1800 and 2850 to 3150
cm\textsuperscript{-1} with 2 cm\textsuperscript{-1} data point distance. This corresponds to a
spectral resolution of about 4 cm\textsuperscript{-1}, and the decrease in spectral resolution can be
seen at the sharp bands where the maxima are not reached (due to the fact that the interpolation
wavelength axis does not necessarily hit the maxima. The original spectrum had
\Sexpr{nwl (paracetamol)} data points with unequal data point spacing (between
\Sexpr{signif (min (diff (wl (paracetamol))), 2)} and
\Sexpr{signif (max (diff (wl (paracetamol))), 2)} cm\textsuperscript{-1}). The interpolated spectrum
has \Sexpr{nwl (p)} data points.
\subsection{Background Correction}
\mFun{\Rfunction{sweep}} To subtract a background spectrum of each of the spectra in an object,
use \Rcode{sweep (spectra, 2, background.spectrum, \textquotedbl{}-\textquotedbl{})}.
\subsection{Offset Correction}
\mFun{\Rfunction{apply} \Rfunction{sweep}} Calculate the offsets and sweep them off the spectra:
<>=
offsets <- apply (chondro, 1, min)
chondro.offset.corrected <- sweep (chondro, 1, offsets, "-")
@
If the offset is calculated by a function, as here with the \Rfunction{min}, \chy's
\Rfunction{sweep} method offers a shortcut: \Rfunction{sweep}'s \Rfunarg{STATS} argument may be the
function instead of a numeric vector:
<>=
chondro.offset.corrected <- sweep (chondro, 1, min, "-")
@
\subsection{Baseline Correction}
\phy comes with two functions to fit polynomial baselines.
\mFun{\Rfunction{spc.fit.poly} \Rfunction{spc.fit.poly.below}}\Rfunction{spc.fit.poly} fits a
polynomial baseline of the given order. A least-squares fit is done so that the function may be used
on rather noisy spectra. However, the user must supply an object that is cut
appropriately. Particularly, the supplied wavelength ranges are not weighted.
\Rfunction{spc.fit.poly.below} tries to find appropriate support
points for the baseline iteratively.
Both functions return a \Rclass{hyperSpec} object containing the
fitted baselines. They need to be subtracted afterwards:
<>=
bl <- spc.fit.poly.below (chondro)
chondro <- chondro - bl
@
For details, see \Rcode{vignette (baselinebelow)}.
Package \Rpackage{baseline} \Sexpr{cite.pkg("baseline")} offers many more functions for baseline
correction. The \Rfunction{baseline} function works on the spectra matrix, which is extracted by
\Rfunction{[[]]}. The result is a \Rclass{baseline} object, but can easily be re-imported into the
\chy object:
<>=
corrected <- hyperSpec::chondro [1] # start with the unchanged data set
require ("baseline")
bl <- baseline (corrected [[]], method = "modpolyfit", degree = 4)
corrected [[]] <- getCorrected (bl)
@
Fig.~\ref{fig:bl-baseline} shows the result for the first spectrum of \Robject{chondro}.
\begin{figure}[tb]
\subfloat[Raw data with baseline]{
<>=
baseline <- corrected
baseline [[]] <- getBaseline (bl)
plot (hyperSpec::chondro [1], plot.args = list (ylim = range (hyperSpec::chondro [1], 0)))
plot (baseline, add = TRUE, col = "red")
@
}\hfill%
\subfloat[Baseline corrected spectrum]{
<>=
plot (corrected, plot.args = list (ylim = range (hyperSpec::chondro [1], 0)))
@
}
\caption{Baseline correction using the \Rpackage{baseline} package: the first spectrum of
\Robject{chondro} with baseline (left) and after baseline correction (right) with method
``modpolyfit''.}
\label{fig:bl-baseline}
\end{figure}
<<>>=
rm (bl, chondro)
@
\subsection{Intensity Calibration}
\subsubsection{Correcting by a constant, e.\,g. Readout Bias}
CCD cameras often operate with a bias, causing a constant value for each pixel. Such a constant can
be immediately subtracted:\\
\Rcode{spectra - constant}
\subsubsection{Correcting Wavelength Dependence}
\mFun{\Rfunction{sweep}} For each of the wavelengths the same correction needs to be applied to all
spectra.
\begin{enumerate}
\item There might be wavelength dependent offsets (background or dark spectra).
They are subtracted:\\
\Rcode{sweep (spectra, 2, offset.spectrum, \textquotedbl{}-\textquotedbl{})}
\item A multiplicative dependency such as a CCD's photon efficiency: \\
\Rcode{sweep (spectra, 2, photon.efficiency, \textquotedbl{}/\textquotedbl{})}
\end{enumerate}
\subsubsection{Spectra Dependent Correction}
\mFun{\Rfunction{sweep}} If the correction depends on the spectra (e.\,g. due to inhomogeneous
illumination while collecting imaging data, differing optical path length, etc.), the
\Rfunarg{MARGIN}of the \Rfunction{sweep} function needs to be 1 or \Rcode{SPC}:
\begin{enumerate}
\item Pixel dependent offsets are subtracted:\\
\Rcode{sweep (spectra, SPC, pixel.offsets, \textquotedbl{}-\textquotedbl{})}
\item A multiplicative dependency: \\
\Rcode{sweep (spectra, SPC, illumination.factors, \textquotedbl{}{*}\textquotedbl{})}
\end{enumerate}
\subsection{Normalization}
\mFun{\Rfunction{apply} \Rfunction{sweep}} Again, \Rfunction{sweep} is the function of
choice. E.\,g. for area normalization, use:
<>=
chondro <- sweep (chondro, 1, mean, "/")
@
(using the mean instead of the sum results in conveniently scaled spectra with intensities around 1.)
If the calculation of the normalization factors is more elaborate, use a two step procedure:
\begin{enumerate}
\item Calculate appropriate normalization factors\\
You may calculate the factors using only a certain wavelength range, thereby normalizing on a
particular band or peak.
\item Again, sweep the factor off the spectra:\\
\Rcode{normalized <- sweep (spectra, 1, factors, \textquotedbl{}{*}\textquotedbl{})}
\end{enumerate}
<>=
factors <- 1 / apply (chondro [, , 1600 ~ 1700], 1, mean)
chondro <- sweep (chondro, 1, factors, "*")
@
For minimum-maximum-normalization, first do an offset- or baseline correction, then normalize using
\Rfunction{max}.
\subsection{Centering and Variance Scaling the Spectra}
\mFun{\Rfunction{scale}} Centering means that the mean spectrum is subtracted from each of the
spectra. Many data analysis techniques, like principal component analysis, partial least squares,
etc., work much better on centered data. From a spectroscopic point of view it depends on the
particular data set whether centering does make sense or not.
Variance scaling is often used in multivariate analysis to adjust the influence and scaling of the
variates (that are typically different physical values). However, spectra already do have the same
scale of the same physical value. Thus one has to trade off the the expected numeric benefit with the
fact that for wavelengths with low signal the noise level will ``explode'' by variance
scaling. Scaling usually makes sense only for centered data.
Both tasks are carried out by the same method in \R, \Rfunction{scale}, which will by default both
mean center and variance scale the spectra matrix.
To center the \Robject{flu} data set, use:
<>=
flu.centered <- scale (flu, scale = FALSE)
plot (flu.centered)
@
On the other hand, the \Robject{chondro} data set consists of Raman spectra, so the spectroscopic
interpretation of centering is getting rid of the the average chemical composition of the
sample. But: what is the meaning of the ``average spectrum'' of an inhomogeneous sample? In this case
it may be better to subtract the minimum spectrum (which will hopefully have almost the same benefit
on the data analysis) as it is the spectrum of that chemical composition that is underlying the whole
sample.
One more point to consider is that the actual minimum spectrum will pick up (negative) noise. In
order to avoid that, using \eg the 5\textsuperscript{th} percentile spectrum is more suitable:
<>=
chondro <- scale (chondro, center = quantile (chondro, 0.05), scale = FALSE)
plot (chondro, "spcprctl5")
@
See section~\ref{sec:speed-considerations} (p.~\ref{sec:speed-considerations}) for some tips to speed
up these calculations.
\subsection{Multiplicative Scatter Correction (MSC)}
\mFun{\Rfunction{pls::msc}}MSC can be done using \Rfunction{msc} from package
\Rpackage{pls}\citep{Wehrens2007}. It operates on the spectra matrix:
<>=
require (pls)
chondro.msc <- chondro
chondro.msc [[]] <- msc (chondro [[]])
@
\subsection{Spectral Arithmetic} \mFun{\Rfunction{+ - {*} / \textasciicircum{} log log10}}
Basic mathematical functions are defined for \Rclass{hyperSpec} objects.
You may convert spectra:\\
\Rcode{absorbance.spectra = - log10 (transmission.spectra)}
In this case, do not forget to adapt the label:\mFun{\Rfunction{labels}}
<