SCM Repository

[tm] View of /pkg/man/termFreq.Rd
ViewVC logotype

View of /pkg/man/termFreq.Rd

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1152 - (download) (as text) (annotate)
Thu Nov 17 14:30:42 2011 UTC (7 years, 9 months ago) by feinerer
File size: 2844 byte(s)
Fix typo
\title{Term Frequency Vector}
  Generate a term frequency vector from a text document.
termFreq(doc, control = list())
  \item{doc}{An object inheriting from \code{TextDocument}.}
  \item{control}{A list of control options which override default
    settings. Possible options are
      \item{\code{tolower}}{A function converting characters to lower
	case. Defaults to \code{\link{tolower}}.}
      \item{\code{removePunctuation}}{A logical value indicating whether
	punctuation characters should be removed from
	\code{doc}, or a function which performs punctuation
	removal. Defaults to \code{FALSE}.}
      \item{\code{tokenize}}{A function tokenizing documents into single
	tokens or a string matching one of the predefined tokenization
	  \item{\code{scan}}{for \code{\link{scan_tokenizer}}, or}
	  \item{\code{MC}}{for \code{\link{MC_tokenizer}}.}
	Defaults to \code{\link{scan_tokenizer}}.}
      \item{\code{removeNumbers}}{A logical value indicating whether
	numbers should be removed from \code{doc}. Defaults to \code{FALSE}.}
      \item{\code{stemming}}{Either a Boolean value indicating whether tokens
	should be stemmed or a stemming function. Defaults to \code{FALSE}.}
      \item{\code{stopwords}}{Either a Boolean value indicating stopword
	removal using default language specific stopword lists shipped
	with this package or a character vector holding custom
	stopwords. Defaults to \code{FALSE}.}
      \item{\code{dictionary}}{A character vector to be tabulated
	against. No other terms will be listed in the result. Defaults
	to \code{NULL} which means that all terms in \code{doc} are
      \item{\code{bounds}}{A list with a tag \code{local} whose value
	must be an integer vector of length 2. Terms that appear less
	often in \code{doc} than the lower bound \code{bounds$local[1]}
	or more often than the upper bound \code{bounds$local[2]} are
	discarded. Defaults to \code{list(local = c(1,Inf))} (i.e., every
	token will be used).}
      \item{\code{wordLengths}}{An integer vector of length 2. Words
	smaller than the minimum word length \code{wordLengths[1]} or
	longer than the maximum word length \code{wordLengths[2]} are
	discarded. Defaults to \code{c(3, Inf)}, i.e., a minimum word
	length of 3 characters.}
  A named integer vector of class \code{term_frequency} with term
  frequencies as values and tokens as names.
strsplit_space_tokenizer <- function(x) unlist(strsplit(x, "[[:space:]]+"))
ctrl <- list(removePunctuation = TRUE, tokenize = strsplit_space_tokenizer,
             stemming = TRUE, minWordLength = 4)
termFreq(crude[[1]], control = ctrl)
ViewVC Help
Powered by ViewVC 1.0.0  
Thanks to:
Vienna University of Economics and Business Powered By FusionForge