Soil Database Interface: the soilDB package

Soil Database Interface: the soilDB package


This package provides methods for extracting soils information from local PedonPC and AK Site databases (MS Access format), local NASIS databases (MS SQL Server), and the SDA web-service. Currently USDA-NCSS data sources are supported, however, there are plans to develop interfaces to outside systems such as the Global Soil Mapping project.


With a recent version of R, it should be possible to get all of the packages that soilDB depends on via:

# run these commands in the R console
install.packages("RODBC", dep=TRUE) # stable version from CRAN + dependencies
install.packages("soilDB", dep=TRUE) # stable version from CRAN + dependencies
install.packages("SSOAP", repos = "", type="source") # SSOAP and XMLSchema

ODBC Connection to the Local NASIS database

Setting up this connection is simple, and outlined in this guide. Queries submitted to the national database will load pedons or DMU data into your local database. Functions in soilDB can only access data in your local database, defined by the “selected set”. While it is possible to filter pedon or DMU records within R, it is sometimes simpler to modify the “selected set” in NASIS. In other words:

  1. load up your local database with pedons associated with your office, project, or MLRA
  2. build your selected set (pedon, site, DMU, etc.)
  3. load data into R for further filtering/processing

R Notes

A basic understanding of R syntax, object structure, package management system, and programming style will greatly help with integration of the soilDB and aqp packages into standard workflows. Lines that start with a # are treated as comments and ignored by R. Recall that online help documents can be accessed via:

# get the manual page for a known function
# get the manual page for a package
# package demo
# use fuzzy matching to search for a topic or keyword
# shortcut:

Mini R Tutorial

# 1. objects: creation, structure, etc.
x <- 1:10       # 'x' is now a vector of integers 1 to 10 
y <- rnorm(10, mean=0, sd=1)  # 'y' is now a vector of 10 random numbers {mean=0, sd=1}
d <- data.frame(x, y) # d is a rectangular table with columns 'x' and 'y'

str(d) # check structure
class(d) # inspect object class
head(d) # view first 6 rows
rm(d, x, y) # clean-up by deleting these objects

# 2. most function are vectorized: i.e. automatic iteration
x <- 1:10
x + 1 # the '+' function knows about vectors, and recycles scalars accordingly

# 3. missing data can cause problems unless accounted for
x <- c(x, NA) # 'c()' is the concatonate function
# print the value by typing it into the R console followed by enter

mean(x) # result is NA, as most functions return NA in the presence of missing data
mean(x, na.rm=TRUE) # mean computed after removal of missing data


The soilDB package provides two layers of functionality for extracting data from file-based databases (PedonPC), relational databases (local NASIS), and online data sources (Soil Data Access, OSD information, KSSL data). High-level functions fetchPedonPC() and fetchNASIS() query, combine, verify horizon logic, and return SoilProfileCollection class objects containing the most commonly used site/pedon/horizon data. Low-level functions such as get_site_data_from_pedon_db() and get_hz_data_from_pedon_db() extract specific data from a designated source, and return the results as a data.frame class object.

High-Level Functions

When loading pedons with the fetchNASIS() or fetchPedonPC() functions, checks are performed for:

Fetch Pedons or DMU data from Local NASIS

The fetchNASIS() function will load pedons from the “selected set” in your local NASIS database. An ODBC connection named 'nasis_local' must first be established.

# fetch pedons from the selected set, replace NULL rock fragment fractions with 0
f <- fetchNASIS(nullFragsAreZero=TRUE)
# fetch pedons from the selected set, do not replace NULL rock fragment fractions with 0
f <- fetchNASIS(nullFragsAreZero=FALSE)

# DMU data
fc <- fetchNASIS_component_data()

Fetch Pedons from PedonPC 5.x Database

The fetchPedonPC(dsn) function will load all pedon records into the current R session, based on the value of dsn– a path to a PedonPC version 5.x database. Note that it is not currently possible to use 64bit R to connect with 32bit Acccess. Therefore, if you plan on using this function, please remember to use the 32bit R executable.

# this will only work if you have a PedonPC Database and Windows :(
dsn <- "S:/Service_Center/NRCS/pedon/pedon.accdb"
f <- fetchPedonPC(dsn)

Fetch data from Soil Data Access (SDA) Web-Service

The SDA_query() function submits a query (ANSI SQL) to the SDA web-service using the SSOAP package, parses the result, and returns a data.frame (rectangular table) object.

# setup a query
q <- "select cokey, compname, comppct_r 
from component 
where compname = 'yolo' and majcompflag = 'Yes' "

# run the query
res <- SDA_query(q)

Get map unit keys for a geographic area

The MUKEYS_by_ll_bbox() function accepts a bounding box defined in GCS (longitude, latitude) NAD83 coordinates, and returns a vector of map unit keys. This vector can be used to generate a filtering criteria for queries submitted via SDA_query().

# define bounding box
b <- c(-120.9, 37.7, -120.8, 37.8)

# get map unit keys for this bbox
m <- MUKEYS_by_ll_bbox(b)

Get map unit geometry for a geographic area

The mapunit_geom_by_ll_bbox() function accepts a bounding box defined in GCS (longitude, latitude) NAD83 coordinates, and returns a SpatialPolygonsDataFrame class object of map unit polygons. Note that map unit polygons that overlap with the bounding box are returned, rather than the intersection of bounding box and polygons.

# define bounding box
b <- c(-120.54,38.61,-120.41,38.70)

# fetch the results, may take about 10 seconds
x <- mapunit_geom_by_ll_bbox(b)

Get basic horizon morphology and taxonomic data from Official Series Descriptions

The fetchOSD() function accepts a character vector of soil series names, and returns a SpatialPolygonsDataFrame class object of basic data from those series. Note that this service is provided by the CA Soil Resource Lab as an experimental prototype. The official series descriptions should still be used to verify series information.

# soils of interest
s.list <- c('musick', 'cecil', 'drummer', 'amador', 'pentz', 'reiff')

# fetch and convert data into an SPC
s <- fetchOSD(s.list)

# plot profiles
plot(s, name='hzname', cex.names=0.85, axis.line.offset=-4)

Get the most commonly used data from the Kellog Soil Science Lab database (formerly NSSL database)

The fetchKSSL() function accepts either a soil series name or bounding box in WGS84 geographic coordinates, and returns a SpatialPolygonsDataFrame class object containing the matching records. Note that this service is provided by the CA Soil Resource Lab as an experimental prototype, based on the June 2012 KSSL snapshot. If newer data are needed, it is best to consult the official KSSL data.

# fetch and convert data into an SPC
s <- fetchKSSL(series='Pardee')

# plot profiles
plot(s, name='hzn_desgn', cex.names=0.85, axis.line.offset=-4)

Low-Level Functions

Fetch Data from a PedonPC 5.x Database These functions can be used to extract only subsets of the data available in a PedonPC database, but are generally only useful for advanced users.

# fetch only site data
site_data <- get_site_data_from_pedon_db(dsn)
# fetch only horizon data
hz_data <- get_hz_data_from_pedon_db(dsn)
# fetch only color data
color_data <- get_colors_from_pedon_db(dsn)
# fetch extended data:
# diagnostic horizons, fragment summary, texture modifiers, geomorphology, taxonomic history
extended_data <- get_extended_data_from_pedon_db(dsn)

Fetch Data from a Local NASIS Database These functions can be used to extract only subsets of the data available in the local NASIS database, but are generally only useful for advanced users.

# fetch only site data
site_data <- get_site_data_from_NASIS_db()
# fetch only horizon data
hz_data <- get_hz_data_from_NASIS_db()
# fetch only color data
color_data <- get_colors_from_NASIS_db()
# fetch extended data:
# diagnostic horizons, fragment summary, texture modifiers, geomorphology, taxonomic history
extended_data <- get_extended_data_from_NASIS_db()


Getting SSURGO data from SDA

A quick example of how to use the USDA-NRCS soil data access query facility (SDA). The following code describes how to get component-level soils data for Yolo County (survey area CA113) and compute representative sub-order level classification for each map unit. This example requires an understanding of SQL, US Soil Taxonomy and the SSURGO database. While not the most efficient approach to the task described below, the code does illustrate several strategies for working with SSURGO data in R.

First, we need to setup some functions that will help summarize component data, within each map unit. These functions will be applied chunk-wise to blocks of data, using the split-apply-combine strategy.

# compute total component percentages for the current block of components
f.sum <- function(i) {
   n <- nrow(i) # number of rows
   s <- sum(i$comppct_r) # total component percent
   return(data.frame(pct=s, n=n)) # return the results

# pick the largest suborder from within each map unit
f.largest <- function(i) {
   i.sorted <- i[order(i$pct, decreasing=TRUE), ] # sort largest -> smallest
   top.suborder <- i.sorted$taxsuborder[1] # get the largest suborder
   top.suborder.pct <- i.sorted$pct[1] # get the the corresponding component percent
   return(data.frame(suborder=top.suborder, pct=top.suborder.pct)) # return results

Load required libraries, setup a query, submit to SDA, and process the results.

# load libraries

# get map unit-level data
q.1 <- "SELECT mukey, muacres
FROM legend
INNER JOIN mapunit ON mapunit.lkey = legend.lkey
WHERE legend.areasymbol = 'CA113'"

# get component-level data
q.2 <- "SELECT 
component.mukey, cokey, comppct_r, compname, taxclname, taxorder, taxsuborder, taxgrtgroup, taxsubgrp
FROM legend
INNER JOIN mapunit ON mapunit.lkey = legend.lkey
LEFT OUTER JOIN component ON component.mukey = mapunit.mukey
WHERE legend.areasymbol = 'CA113'"

# submit queries
mu <- SDA_query(q.1)
co <- SDA_query(q.2)

# tabulate percentage of suborder-level taxa within each map unit (mukey)
comp.suborder.sums <- ddply(co, .(mukey, taxsuborder), f.sum, .progress='text')

# keep the largest suborder, and its associated total percentage
comp.suborder <- ddply(comp.suborder.sums, .(mukey), f.largest, .progress='text')

# join the largest suborder (by map unit) to mu
x <- join(mu, comp.suborder, by='mukey')

# get an effective map unit area via: muacres * pct/100
x$effective.area <- with(x, muacres * pct / 100)

Compute the acreage of each suborder within Yolo county, round to the nearest integer, and sort. A spatially-explicit approach would be required for multi-survey area scenarios or an irregular chunk of a single survey.

sort(round(tapply(x$effective.area, x$suborder, sum)))
## Fluvents  Aquents Orthents  Aquepts 
##    10773    12028    41829    49480

Fetch SSURGO Polygons from SDA

Map unit polygons can also be queried from SDA, using a similar bounding box query, with the mapunit_geom_by_ll_bbox() function. This function returns a SpatialPolygonsDataFrame class object, containing all map unit delineations that overlap with the supplied bounding box. Post-processing with functions from the rgeos package can be used to extract the spatial intersection between map unit polygons and a region of interest.

# define bounding box
b <- c(-120.54,38.61,-120.41,38.70)
# query geometry
x <- mapunit_geom_by_ll_bbox(b) # about 20 seconds

# reset margins, and plot the results
# add our original bounding box in red
rect(b[1], b[2], b[3], b[4], border='red', lwd=2)
# add a title at the bottom of the figure
title(sub='SSURGO Map Unit Delineations', line=-2, font=2)

plot of chunk ssurgo-polygons-demo-1

# follow-up by retrieving corresponding map unit data from SDA
in.statement <- format_SQL_in_statement(unique(x$mukey))
q <- paste("SELECT mukey, muname FROM mapunit WHERE mukey IN ", in.statement, sep="")
res <- SDA_query(q)

The Gopheridge Sample Dataset

The gopheridge sample dataset is very similar to the type of data returned from fetchNASIS() or fetchPedonPC(). The following demonstration is geared towards intermediate users of R and who are familiar with the classes and methods defined by the aqp package. Before proceeding it may be helpful to review the aqp manual pages: library(aqp) ; help(aqp), or the SoilProfileCollection object introduction.

Open R, and setup the environment by loading packages and the sample dataset.

library(Hmisc) # lots of helper functions in here

# load example dataset

# what kind of object is this?
## [1] "SoilProfileCollection"
## attr(,"package")
## [1] "aqp"

Define a simple function for extracting the top depth of a diagnostic feature, and use to compute the top depth of Cr contact for each pedon. Note that the arguments to this function are x (a 1-item SoilProfileCollection, i.e. a single soil profile), and d.hz (the name of the diagnostic feature or horizon that we are querying). If the queried diagnostic feature/horizon is not present for any given pedon, NA is returned. This function is used in conjunction with profileApply(), such that it is evaluated for each pedon within a collection. The result is a vector of the same length as the number of pedons in the collection.

# conditional eval of top hz depth of diagnostic feature or horizon <- function(x, d.hz) {
    # extract diagnostic horizon data
    d <- diagnostic_hz(x)
    # subset to the requested diagnostic hz
    d <- subset(d, diag_kind == d.hz)

    # if missing return NA
    if(nrow(d) == 0)

    # return top depth

# compute depth to paralithic contact, save the results as site-level data
gopheridge$ <- profileApply(gopheridge,, d.hz='paralithic contact')

Re-order the gopheridge profiles by presence/absence of paralithic contact, plot, and then annotate depth of Cr contact.

# order by presence of paralithic contact and its depth
new.order <- order(gopheridge$, gopheridge$
# setup margins:
plot(gopheridge, name='hzname', plot.order=new.order)
# annotate paralithic contact with lines
x.pos <- 1:length(gopheridge)
lines(x.pos, gopheridge$[new.order], lty=2, lwd=2, col='black')

plot of chunk basic_plotting

This document is based on aqp version 1.8-6 and soilDB version 1.5-3. [2015-04-23]