SCM

SCM Repository

[tm] Diff of /trunk/tm/inst/doc/tm.Rnw
ViewVC logotype

Diff of /trunk/tm/inst/doc/tm.Rnw

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 847, Sun Apr 27 16:16:47 2008 UTC revision 848, Tue Apr 29 16:51:43 2008 UTC
# Line 1  Line 1 
1  \documentclass[a4paper]{article}  \documentclass[a4paper]{article}
2    
3  \usepackage[utf8]{inputenc}  \usepackage[utf8]{inputenc}
4    \usepackage{url}
5  \DeclareUnicodeCharacter{201C}{"}  \DeclareUnicodeCharacter{201C}{"}
6  \DeclareUnicodeCharacter{201D}{"}  \DeclareUnicodeCharacter{201D}{"}
7    
8  \newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}  \newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}
9  \newcommand{\class}[1]{\mbox{\textsf{#1}}}  \newcommand{\class}[1]{\mbox{\textsf{#1}}}
 \newcommand{\func}[1]{\mbox{\texttt{#1()}}}  
10  \newcommand{\code}[1]{\mbox{\texttt{#1}}}  \newcommand{\code}[1]{\mbox{\texttt{#1}}}
11  \newcommand{\pkg}[1]{\strong{#1}}  \newcommand{\pkg}[1]{\strong{#1}}
 \newcommand{\samp}[1]{`\mbox{\texttt{#1}}'}  
12  \newcommand{\proglang}[1]{\textsf{#1}}  \newcommand{\proglang}[1]{\textsf{#1}}
 \newcommand{\set}[1]{\mathcal{#1}}  
13  \newcommand{\acronym}[1]{\textsc{#1}}  \newcommand{\acronym}[1]{\textsc{#1}}
14    
15  %% \VignetteIndexEntry{Introduction to the tm Package}  %% \VignetteIndexEntry{Introduction to the tm Package}
16    
17  \begin{document}  \begin{document}
18  <<echo=FALSE>>=  <<Init,echo=FALSE>>=
19  options(width = 75)  options(width = 60)
20  ### for sampling  library("tm")
21  set.seed <- 1234  data("crude")
22  @  @
23  \title{Introduction to the \pkg{tm} Package\\Text Mining in \proglang{R}}  \title{Introduction to the \pkg{tm} Package\\Text Mining in \proglang{R}}
24  \author{Ingo Feinerer}  \author{Ingo Feinerer}
25  \maketitle  \maketitle
26  \sloppy  \sloppy
27    
28  \begin{abstract}  \section*{Introduction}
29  This vignette gives a short overview over available features in the  This vignette gives a short introduction to text mining in
30  \pkg{tm} package for text mining purposes in \proglang{R}.  \proglang{R} utilizing the text mining framework provided by the
31  \end{abstract}  \pkg{tm} package. We present methods for data import, corpus
32    handling, preprocessing, meta data management, and creation of
33  \section*{Loading the Package}  term-document matrices. Our focus is on the main
34  Before actually working we need to load the package:  aspects of getting started with text mining in \proglang{R}---an in-depth
35  <<>>=  description of the text mining infrastructure offered by \pkg{tm} was
36  library("tm")  published in the \emph{Journal of Statistical Software} and is available at
37  @  \url{http://www.jstatsoft.org/v25/i05}.
38    
39  \section*{Data Import}  \section*{Data Import}
40  The main structure for managing documents is a so-called text document  The main structure for managing documents in \pkg{tm} is a so-called
41  collection, denoted as corpus in linguistics (\class{Corpus}).  \class{Corpus}, representing a collection of text documents. A corpus
42  Its constructor takes following arguments:  can be created via its constructor
43  \begin{itemize}  \code{Corpus(object, readerControl, dbControl)}.
44  \item \code{object}: a \class{Source} object which abstracts the input location.  
45  \item \code{readerControl}: a list with the named components  \code{object} must be a \class{Source} object which abstracts the
46    \code{reader}, \code{language}, and \code{load}.  input location. Available sources provided by \pkg{tm} are
47    A reader constructs a text document from a single element  \class{DirSource}, \class{VectorSource}, \class{CSVSource},
48    delivered by a source. A reader must have the argument signature \code{(elem,  \class{GmaneSource} and \class{ReutersSource} which handle a
49      load, language, id)}. The first argument is the element provided  directory, a vector interpreting each component as document, \acronym{Csv}
50    from the source, the second gives the text's language, the third  files, a \acronym{Rss} feed as delivered by the Gmane mailing list
51    indicates whether the user wants to load the documents immediately  archive service, and a Reuters file containing several documents,
52    into memory, and the fourth is a unique identification string.  respectively.  Except \class{DirSource}, which is designed
53    If the passed over \code{reader} object is of  solely for directories on a file system, and \class{VectorSource},
54    class~\class{FunctionGenerator}, it is assumed to be a function  which only accepts (character) vectors, all other implemented sources
   generating a reader. This way custom readers taking various  
   parameters (specified in \code{...}) can be built, which in fact  
   must produce a valid reader signature but can access additional  
   parameters via lexical scoping (i.e., by the including  
   environment).  
 \item \code{dbControl}: a list with the named components \code{useDb}  
   indicating that database support should be activated, \code{dbName} giving the  
   filename holding the sourced out objects (i.e., the database), and  
   \code{dbType} holding a valid database type as supported by  
   \code{filehash}. Under activated database  
   support the \code{tm} packages tries to keep as few as possible  
   resources in memory under usage of the database.  
 \item \code{...}: Further arguments to the reader.  
 \end{itemize}  
   
 Available sources are \class{DirSource}, \class{CSVSource},  
 \class{GmaneSource} and  
 \class{ReutersSource} which handle a directory, a mixed CSV, a Gmane  
 mailing list archive \acronym{Rss} feed or a  
 mixed Reuters file (mixed means several documents are in a single  
 file). Except \class{DirSource}, which is designated  
 solely for directories on a file system, all other implemented sources  
55  can take connections as input (a character string is interpreted as  can take connections as input (a character string is interpreted as
56  file path).  file path). \code{getSources()} lists available sources, and the
57    user can create his own sources.
58    
59  This package ships with several readers (\code{readPlain()}  \code{readerControl} has to be a list with the named components
60    \code{reader}, \code{language}, and \code{load}. The first component
61    \code{reader} constructs a text document from elements delivered by a
62    source. The \pkg{tm} package ships with several readers (\code{readPlain()}
63  (default), \code{readRCV1()}, \code{readReut21578XML()},  (default), \code{readRCV1()}, \code{readReut21578XML()},
64  \code{readGmane()}, \code{readNewsgroup()}, \code{readPDF()},  \code{readGmane()}, \code{readNewsgroup()}, \code{readPDF()},
65  \code{readDOC()} and \code{readHTML()}).  \code{readDOC()} and \code{readHTML()}). See \code{getReaders()} for
66  Each source has a default reader which can be overridden. E.g., for \code{DirSource} the  an up-to-date list of available readers. Each source has a default
67  default just reads in the whole input files and interprets their  reader which can be overridden. E.g., for \code{DirSource} the default
68  content as text.  just reads in the input files and interprets their content as
69    text. The second component \code{language} sets the texts' language,
70  Plain text files in a directory:  the third component \code{load} can activate lazy document loading,
71  <<keep.source=TRUE>>=  i.e., whether documents should be immediately loaded into memory or
72    not.
73    
74    Finally \code{dbControl} has to be a list with the named components
75    \code{useDb} indicating that database support should be activated,
76    \code{dbName} giving the filename holding the sourced out objects
77    (i.e., the database), and  \code{dbType} holding a valid database type
78    as supported by package \pkg{filehash}. Activated database support reduces
79    the memory demand, however, access gets slower since each operation
80    is limited by the hard disk's read and write capabilities.
81    
82    So e.g., plain text files in the directory \code{txt} containing Latin
83    (\code{la}) texts by the Roman poet \emph{Ovid} can be read in with
84    following code:
85    <<Ovid,keep.source=TRUE>>=
86  txt <- system.file("texts", "txt", package = "tm")  txt <- system.file("texts", "txt", package = "tm")
87  (ovid <- Corpus(DirSource(txt),  (ovid <- Corpus(DirSource(txt),
88                      readerControl = list(reader = readPlain,                      readerControl = list(reader = readPlain,
89                                           language = "la",                                       language = "la")))
                                          load = TRUE)))  
 @  
   
 A single comma separated values file:  
 <<>>=  
 # Comma separated values  
 cars <- system.file("texts", "cars.csv", package = "tm")  
 Corpus(CSVSource(cars))  
90  @  @
91    Another example could be mails from newsgroups (as found in the
92  Reuters21578 files either in directory (one document per file) or a single  \acronym{Uci} \acronym{Kdd} newsgroup data set):
93  file (several documents per file). Note that connections can be used  <<Mails,keep.source=TRUE>>=
 as input:  
 <<keep.source=TRUE>>=  
 # Reuters21578 XML  
 reut21578 <- system.file("texts", "reut21578", package = "tm")  
 reut21578XML <- system.file("texts", "reut21578.xml", package = "tm")  
 reut21578XMLgz <- system.file("texts", "reut21578.xml.gz", package = "tm")  
   
 (reut21578TDC <- Corpus(DirSource(reut21578),  
                             readerControl = list(reader = readReut21578XML,  
                                                  language = "en_US",  
                                                  load = FALSE)))  
   
 Corpus(ReutersSource(reut21578XML),  
            readerControl = list(reader = readReut21578XML,  
                                 language = "en_US", load = FALSE))  
 Corpus(ReutersSource(gzfile(reut21578XMLgz)),  
            readerControl = list(reader = readReut21578XML,  
                                 language = "en_US", load = FALSE))  
 @  
 Depending on your exact input format you might find  
 \code{preprocessReut21578XML()} useful. For the original downloadable  
 archive this function can correct invalid \acronym{Utf8} encodings and  
 can copy each text document into a separate file to enable load on  
 demand.  
   
 Analogously we can construct collections for files in the Reuters  
 Corpus Volume 1 format:  
 <<>>=  
 # Reuters Corpus Volume 1  
 rcv1 <- system.file("texts", "rcv1", package = "tm")  
 rcv1XML <- system.file("texts", "rcv1.xml", package = "tm")  
   
 Corpus(DirSource(rcv1),  
            readerControl = list(reader = readRCV1, language = "en_US", load = TRUE))  
 Corpus(ReutersSource(rcv1XML),  
            readerControl = list(reader = readRCV1, language = "en_US", load = FALSE))  
 @  
   
 Or mails from newsgroups (as found in the \acronym{Uci} \acronym{Kdd} newsgroup data set):  
 <<>>=  
 # UCI KDD Newsgroup Mails  
94  newsgroup <- system.file("texts", "newsgroup", package = "tm")  newsgroup <- system.file("texts", "newsgroup", package = "tm")
95    
96  Corpus(DirSource(newsgroup),  Corpus(DirSource(newsgroup),
97             readerControl = list(reader = readNewsgroup, language = "en_US", load = TRUE))         readerControl = list(reader = readNewsgroup,
98  @                              language = "en_US"))
   
 \acronym{Rss} feed as delivered by Gmane for the \proglang{R} mailing list archive:  
 <<eval=FALSE>>=  
 rss <- system.file("texts", "gmane.comp.lang.r.gr.rdf", package = "tm")  
   
 Corpus(GmaneSource(rss),  
            readerControl = list(reader = readGmane, language = "en_US", load = FALSE))  
99  @  @
100    or \acronym{Pdf} documents:
101  For very simple \acronym{Html} documents:  <<PDF,keep.source=TRUE>>=
 <<eval=FALSE>>=  
 html <- system.file("texts", "html", package = "tm")  
   
 Corpus(DirSource(html),  
            readerControl = list(reader = readHTML, load = TRUE))  
 @  
   
 And for \acronym{Pdf} documents:  
 <<>>=  
102  pdf <- system.file("texts", "pdf", package = "tm")  pdf <- system.file("texts", "pdf", package = "tm")
103    
104  Corpus(DirSource(pdf),  Corpus(DirSource(pdf),
105             readerControl = list(reader = readPDF, language = "en_US", load = TRUE))         readerControl = list(reader = readPDF))
106  @  @
107  Note that \code{readPDF()} needs \code{pdftotext} and \code{pdfinfo}  Note that \code{readPDF()} needs \code{pdftotext} and \code{pdfinfo}
108  installed on your system to be able to extract the text and meta  installed on your system to be able to extract the text and meta
109  information from your \acronym{Pdf}s.  information from your \acronym{Pdf}s.
110    
111  Finally, for \acronym{Ms} Word documents there is the reader function  For simple examples \code{VectorSource} is quite useful, as it can
112  \code{readDOC()}. You need \code{antiword} installed on your system to  create a corpus from simple character vectors, e.g.:
113  be able to extract the text from your Word documents.  <<VectorSource,keep.source=TRUE>>=
114    docs <- c("This is a text.", "This another one.")
115    Corpus(VectorSource(docs))
116    @
117    
118    Finally we create a corpus for some Reuters documents as example for
119    later use:
120    <<Reuters,keep.source=TRUE>>=
121    reut21578 <- system.file("texts", "reut21578", package = "tm")
122    reuters <- Corpus(DirSource(reut21578),
123                      readerControl = list(reader = readReut21578XML))
124    @
125    
126  \section*{Data Export}  \section*{Data Export}
127  For the case you have created a text collection via manipulating other  For the case you have created a text collection via manipulating other
128  objects in \proglang{R}, thus do not have the texts already stored,  objects in \proglang{R}, thus do not have the texts already stored on
129  and want to save the text documents to disk, you can simply use  a hard disk, and want to save the text documents to disk, you can
130  standard \proglang{R} routines for writing out plain text  simply use standard \proglang{R} routines for writing out plain text
131  documents. E.g.,  documents. E.g.,
132  <<eval=false>>=  <<eval=FALSE,keep.source=TRUE>>=
133  lapply(ovid, function(x) writeLines(x, paste(ID(x), ".txt", sep = "")))  lapply(ovid,
134           function(x) writeLines(x, paste(ID(x), ".txt", sep = "")))
135  @  @
136  Alternatively there is the function \code{writeCorpus()} which  Alternatively there is the function \code{writeCorpus()} which
137  encapsulates this functionality.  encapsulates this functionality.
138    
139  \section*{Inspecting the Text Document Collection}  \section*{Inspecting Corpora}
140  Custom \code{show} and \code{summary} methods are available, which  Custom \code{show()} and \code{summary()} methods are available, which
141  hide the raw amount of information (consider a collection could  hide the raw amount of information (consider a collection could
142  consists of several thousand documents, like a  consists of several thousand documents, like a
143  database). \code{summary} gives more details on metadata than  database). \code{summary()} gives more details on meta data than
144  \code{show}, whereas in order to actually see the content of text  \code{show()}, whereas the full content of text documents is displayed
145  documents use the command \code{inspect} on a collection.  with \code{inspect()} on a collection.
146  <<>>=  <<>>=
 show(ovid)  
 summary(ovid)  
147  inspect(ovid[1:2])  inspect(ovid[1:2])
148  @  @
149    
150  \section*{Transformations}  \section*{Transformations}
151  Once we have a text document collection one typically wants to modify  Once we have a text document collection we typically want to modify
152  the documents in it, e.g., stemming, stopword removal, et cetera.  In  the documents in it, e.g., stemming, stopword removal, et cetera.  In
153  \pkg{tm}, all this functionality is subsumed into the concept of  \pkg{tm}, all this functionality is subsumed into the concept of
154  \emph{transformation}s. Transformations are done via the \code{tmMap}  \emph{transformation}s. Transformations are done via the \code{tmMap}
# Line 218  Line 157 
157  and \code{tmMap} just applies them to all documents in a document  and \code{tmMap} just applies them to all documents in a document
158  collection.  collection.
159    
160  \subsection*{Loading Documents into Memory}  \subsection*{Converting to Plain Text Documents}
161  If the source objects supports load on demand, but the user has not  The text document collection \code{reuters} contains documents
162  enforced the package to load the input content directly into memory,  in \acronym{Xml} format. We have no further use for the \acronym{Xml}
163  this can be done manually via \code{loadDoc}. Normally it is not  interna and just want to work with the text content. This can be done
164  necessary to call this explicitly, as other functions working on text  by converting the documents to plain text documents. It is done by the
165  corpora trigger this function for not-loaded documents (the corpus is  generic \code{asPlain()}.
 automatically loaded if accessed via \code{[[}).  
 <<>>=  
 reut21578TDC <- tmMap(reut21578TDC, loadDoc)  
 @  
   
 \subsection*{Converting to Plaintext Documents}  
 The text document collection \code{reut21578TDC} contains documents  
 in XML format. We have no further use for the XML interna and just  
 want to work with the text content. This can be done by converting the  
 documents to plaintext documents. It is done by the generic  
 \code{asPlain}.  
166  <<>>=  <<>>=
167  reut21578TDC <- tmMap(reut21578TDC, asPlain)  reuters <- tmMap(reuters, asPlain)
168  @  @
169    
170  \subsection*{Eliminating Extra Whitespace}  \subsection*{Eliminating Extra Whitespace}
171  Extra whitespace is eliminated by:  Extra whitespace is eliminated by:
172  <<>>=  <<>>=
173  reut21578TDC <- tmMap(reut21578TDC, stripWhitespace)  reuters <- tmMap(reuters, stripWhitespace)
174  @  @
175    
176  \subsection*{Convert to Lower Case}  \subsection*{Convert to Lower Case}
177  Conversion to lower case by:  Conversion to lower case by:
178  <<>>=  <<>>=
179  reut21578TDC <- tmMap(reut21578TDC, tmTolower)  reuters <- tmMap(reuters, tmTolower)
180  @  @
181    
182  \subsection*{Remove Stopwords}  \subsection*{Remove Stopwords}
183  Removal of stopwords by:  Removal of stopwords by:
184  <<>>=  <<>>=
185  reut21578TDC <- tmMap(reut21578TDC, removeWords, stopwords("english"))  reuters <- tmMap(reuters, removeWords, stopwords("english"))
186  @  @
187    
188  \subsection*{Stemming}  \subsection*{Stemming}
189  Stemming is done by:  Stemming is done by:
190  <<>>=  <<>>=
191  tmMap(reut21578TDC, stemDoc)  tmMap(reuters, stemDoc)
192  @  @
193    
194  \section*{Filters}  \section*{Filters}
195  Often it is of special interest to filter out documents satisfying given  Often it is of special interest to filter out documents satisfying given
196  properties. For this purpose the function \code{tmFilter} is  properties. For this purpose the function \code{tmFilter} is
197  designated. It is possible to write custom filter functions, but for  designed. It is possible to write custom filter functions, but for
198  most cases the default filter does its job: it integrates a minimal  most cases the default filter does its job: it integrates a minimal
199  query language to filter metadata. Statements in this query language  query language to filter metadata. Statements in this query language
200  are statements as used for subsetting data frames.  are statements as used for subsetting data frames.
201    
202  E.g., the following statement filters out those documents having  E.g., the following statement filters out those documents having the string
203  \code{COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE} as their  ``\code{COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE}'' as their
204  heading and an \code{ID} equal to 10 (both are metadata slot  heading and an \code{ID} equal to 10 (both are metadata slot
205  variables of the text document).  variables of the text document).
206  <<keep.source=TRUE>>=  <<keep.source=TRUE>>=
207  query <- "identifier == '10' &  query <- "identifier == '10' &
208            heading == 'COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE'"            heading == 'COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE'"
209  tmFilter(reut21578TDC, query)  tmFilter(reuters, query)
210  @  @
211    
212  There is also a full text search filter available which accepts regular  There is also a full text search filter available which accepts regular
213  expressions:  expressions:
214  <<>>=  <<keep.source=TRUE>>=
215  tmFilter(reut21578TDC, FUN = searchFullText, "partnership", doclevel = TRUE)  tmFilter(reuters, FUN = searchFullText,
216  @           pattern = "partnership", doclevel = TRUE)
   
 \section*{Adding Data or Metadata}  
   
 Text documents or metadata can be added to text document collections  
 with \code{appendElem} and \code{appendMeta}, respectively. The text  
 document collection has two types of metadata: one is the metadata on  
 the document collection level (\code{cmeta}), the other is the metadata  
 related to the individual documents (e.g., clusterings) (\code{dmeta})  
 in form of a dataframe. For the method \code{appendElem} it is possible  
 to give a row of values in the dataframe for the added data element.  
 <<>>=  
 data(crude)  
 reut21578TDC <- appendElem(reut21578TDC, crude[[1]], 0)  
 reut21578TDC <- appendMeta(reut21578TDC, cmeta = list(test = c(1,2,3)), dmeta = list(cl1 = 1:11))  
 summary(reut21578TDC)  
 CMetaData(reut21578TDC)  
 DMetaData(reut21578TDC)  
217  @  @
218    
219  \section*{Removing Metadata}  \section*{Meta Data Management}
220  The metadata of text document collections can be easily modified or  Meta data is used to annotate text documents or whole corpora with
221  removed:  additional information. The easiest way to accomplish this with
222  <<>>=  \pkg{tm} is to use the \code{meta()} function. A text document has a
223  data(crude)  few predefined slots like \code{Author}, but can be extended with an
224  reut21578TDC <- removeMeta(reut21578TDC, cname = "test", dname = "cl1")  arbitrary number of local meta data tags. Alternatively to \code{meta()}
225  CMetaData(reut21578TDC)  the function \code{DublinCore()} provides a full mapping
226  DMetaData(reut21578TDC)  between Simple Dublin Core meta data and \pkg{tm} meta data structures
227    and can be similarly used to get and set meta data information for
228    text documents, e.g.:
229    <<DublinCore>>=
230    DublinCore(crude[[1]], "Creator") <- "Ano Nymous"
231    meta(crude[[1]])
232    @
233    
234    For corpora the story is a bit more difficult. Text document
235    collections in \pkg{tm} have two types of meta data: one is the meta
236    data on the document collection level (\code{corpus} level), the other
237    is the meta data related to the individual documents (\code{indexed}
238    level) in form of a data frame. The latter is often done for
239    performance reasons (hence the named \code{indexed} for indexing) or
240    because the meta data has an own entity but still relates directly to
241    individual text documents, e.g., a classification result; the
242    classifications directly relate to the documents, but the set of classification
243    levels forms an own entity. Both cases can be handled with \code{meta()}:
244    <<>>=
245    meta(crude, tag = "test", type = "corpus") <- "test meta"
246    meta(crude, type = "corpus")
247    meta(crude, "foo") <- letters[1:20]
248    meta(crude)
249  @  @
250    
251  \section*{Operators}  \section*{Standard Operators and Functions}
252  Many standard operators and functions (\code{[}, \code{[<-}, \code{[[}, \code{[[<-},  Many standard operators and functions (\code{[}, \code{[<-}, \code{[[}, \code{[[<-},
253    \code{c}, \code{length}, \code{lapply}, \code{sapply}) are available for text document  \code{c()}, \code{length()}, \code{lapply()}, \code{sapply}()) are
254  collections with semantics similar to standard \proglang{R}  available for text document collections with semantics similar to standard \proglang{R}
255  routines.  routines. E.g., \code{c()} concatenates two (or more) text document
 E.g., \code{c} concatenates two (or more) text document  
256  collections. Applied to several text documents it returns a text  collections. Applied to several text documents it returns a text
257  document collection. The metadata is automatically updated, if text  document collection. The metadata is automatically updated, if text
258  document collections are concatenated (i.e., merged).  document collections are concatenated (i.e., merged).
259    
260  Note also the custom element-of operator---it checks whether a text  There is also a custom element-of operator---it checks whether a text
261  document is already in a text document collection (metadata is not  document is already in a text document collection (metadata is not
262  checked, only the corpus):  checked, only the corpus):
263  <<>>=  <<>>=
264  crude[[1]] %IN% reut21578TDC  reuters[[1]] %IN% reuters
265  crude[[2]] %IN% reut21578TDC  crude[[1]] %IN% reuters
 @  
   
 \section*{Keeping Track of Text Document Collections}  
 There is a mechanism available for managing text document  
 collections. It is called \class{TextRepository}. A typical use would  
 be to save different states of a text document collection. A  
 repository has metadata in list format which can be either set with  
 \code{appendElem} as additional argument (e.g., a date when a new  
 element is added), or directly with \code{appendMeta}.  
 <<>>=  
 data(acq)  
 repo <- TextRepository(reut21578TDC)  
 repo <- appendElem(repo, acq, list(modified = date()))  
 repo <- appendMeta(repo, list(moremeta = 5:10))  
 summary(repo)  
 RepoMetaData(repo)  
 summary(repo[[1]])  
 summary(repo[[2]])  
266  @  @
267    
268  \section*{Creating Term-Document Matrices}  \section*{Creating Term-Document Matrices}
269  A common approach in text mining is to create a term-document matrix  A common approach in text mining is to create a term-document matrix
270  for given texts. In this package the class \class{TermDocMatrix}  from a corpus. In the \pkg{tm} package the class \class{TermDocMatrix}
271  handles sparse matrices for text document collections.  handles sparse matrices for text document collections.
272  <<>>=  <<>>=
273  tdm <- TermDocMatrix(reut21578TDC)  tdm <- TermDocMatrix(reuters)
274  Data(tdm)[1:8,150:155]  Data(tdm)[1:5,150:155]
275  @  @
276    
277  \section*{Operations on Term-Document Matrices}  \section*{Operations on Term-Document Matrices}
278  Besides the fact that on the \code{Data} part of this matrix a huge amount of \proglang{R}  Besides the fact that on the \code{Data} part of this matrix a huge
279  functions (like clustering, classifications, etc.) is possible, this  amount of \proglang{R} functions (like clustering, classifications,
280  package brings some shortcuts. Consider we  etc.) is possible, this package brings some shortcuts. Imagine we
281  want to find those terms that occur at least 5 times:  want to find those terms that occur at least five times, then we can use
282    the \code{findFreqTerms()} function:
283  <<>>=  <<>>=
284  findFreqTerms(tdm, 5, Inf)  findFreqTerms(tdm, 5)
285  @  @
286  Or we want to find associations (i.e., terms which correlate) with at  Or we want to find associations (i.e., terms which correlate) with at
287  least $0.97$ correlation for the term \code{crop}:  least $0.97$ correlation for the term \code{crop}, then we use
288    \code{findAssocs()} (we only display ten arbitrary associations
289    found):
290  <<>>=  <<>>=
291  findAssocs(tdm, "crop", 0.97)  findAssocs(tdm, "crop", 0.97)[31:40]
292  @  @
293  The function also accepts a matrix as first argument (which does not  The function also accepts a matrix as first argument (which does not
294  inherit from a term-document matrix). This matrix is then interpreted  inherit from a term-document matrix). This matrix is then interpreted
# Line 394  Line 311 
311  A dictionary is a (multi-)set of strings. It is often used to represent  A dictionary is a (multi-)set of strings. It is often used to represent
312  relevant terms in text mining. We provide a class \class{Dictionary}  relevant terms in text mining. We provide a class \class{Dictionary}
313  implementing such a dictionary concept. It can be created via the  implementing such a dictionary concept. It can be created via the
314  \code{Dictionary} constructor, e.g.,  \code{Dictionary()} constructor, e.g.,
315  <<>>=  <<>>=
316  (d <- Dictionary(c("dlrs", "crude", "oil")))  (d <- Dictionary(c("dlrs", "crude", "oil")))
317  @  @
318  and may be passed over to the \code{TermDocMatrix} constructor. Then  and may be passed over to the \code{TermDocMatrix()} constructor. Then
319  the created matrix is tabulated against the dictionary, i.e., only  the created matrix is tabulated against the dictionary, i.e., only
320  terms from the dictionary appear in the matrix. This allows to  terms from the dictionary appear in the matrix. This allows to
321  restrict the dimension of the matrix a priori and to focus on specific  restrict the dimension of the matrix a priori and to focus on specific
322  terms for distinct text mining contexts, e.g.,  terms for distinct text mining contexts, e.g.,
323  <<>>=  <<>>=
324  tdmD <- TermDocMatrix(reut21578TDC, list(dictionary = d))  tdmD <- TermDocMatrix(reuters, list(dictionary = d))
325  Data(tdmD)  Data(tdmD)
326  @  @
 You can also create a dictionary from a term-document matrix via  
 \code{createDictionary} holding all terms from the matrix e.g.,  
 <<>>=  
 createDictionary(tdm)[100:110]  
 @  
327  \end{document}  \end{document}

Legend:
Removed from v.847  
changed lines
  Added in v.848

root@r-forge.r-project.org
ViewVC Help
Powered by ViewVC 1.0.0  
Thanks to:
Vienna University of Economics and Business Powered By FusionForge