SCM Repository

[tm] Diff of /trunk/tm/man/TextDocCol.Rd
ViewVC logotype

Diff of /trunk/tm/man/TextDocCol.Rd

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 61, Mon Oct 23 20:07:05 2006 UTC revision 62, Tue Oct 24 10:08:58 2006 UTC
# Line 4  Line 4 
4  \alias{TextDocCol,character-method}  \alias{TextDocCol,character-method}
5  \title{Text document collection}  \title{Text document collection}
6  \description{  \description{
7    Constructs a text document collection from various sources.    Constructs a text document collection.
8  }  }
9  \usage{  \usage{
10  \S4method{TextDocCol}{character}(object, inputType = "CSV",  \S4method{TextDocCol}{character}(object, parser = plaintext.parser, lod = FALSE)
 stripWhiteSpace = FALSE, toLower = FALSE)  
11  }  }
12  \arguments{  \arguments{
13    \item{object}{a directory containing the documents.}    \item{object}{a directory containing the documents.}
14    \item{inputType}{determines the input format. Possible settings are    \item{parser}{a parsing function capable of handling the file format
15      \itemize{      found in \code{object}.}
16        \item \code{PLAIN} Plain text format    \item{lod}{a logical value indicating whether the text corpus should
17        be loaded immediately into memory (\code{lod = TRUE}) or loaded when
18        Each entry in the \code{object} directory is imported without      necessary (\code{lod = FALSE}). This allows to minimize memory
19        further knowledge about the file format. That means the file is      demands for large document collections.}
       interpreted as plain text without any markup or meta  
       information. The document \code{ID}s are generated automatically  
       and can be found together with the filename in the plain text  
       document's metadata.  
       \item \code{CSV} Comma seperated values format  
       This input format requires no special markup (like XML). Instead a  
       plain file with comma separated values per line is used. The first  
       column has to be the identification number for the text document,  
       the rest is used as the text corpus. All lines must have the same  
       number of arguments.  
       This format is especially useful for the fast import from other  
       sources (like export from Excel files, \ldots).  
       \item \code{RCV1} Reuters Corpus Volume 1 XML format  
       The Reuters Corpus Volume 1 comes in a special XML format which  
       can be imported directly with this switch.  
       \item \code{REUT21578} Reuters 21578 XML format  
       An XML file format as used by the Reuters21578 XML dataset. This  
       switch returns directly plain text documents with the corpus in  
       \item \code{REUT21578_XML} Reuters 21578 XML format  
       An XML file format as used by the Reuters21578 XML dataset. The  
       pure \code{XMLTextDocument}s are returned with loading on demand  
       activated. Thus the text corpus resides on disk until used.  
       \item \code{NEWSGROUP} UCI KDD Newsgroup Dataset format.  
       A file format as found in the UCI KDD Newsgroup dataset. Thus each  
       newsgroup mail must be named after an unique \code{ID} and contain  
       distinct meta information in form of headers.  
       \item \code{RIS} Austrian Rechtsinformationssystem des Bundes HTML  
       Each HTML file has to follow this name schema:  
       \code{\emph{Geschäftszahl}.html}. This ensures that we can  
       identify each document, as the body of the HTML file is not  
       further investigated.  
       Austrian RIS: \url{}  
20      }      }
21    }    }
   \item{stripWhiteSpace}{if set, replaces any white space in the texts with a  
     single space.}  
   \item{toLower}{if set, replaces all upper characters in the text with  
     lower ones.}  
22  \value{  \value{
23    An S4 object of class \code{TextDocCol} which extends the class    An S4 object of class \code{TextDocCol} which extends the class
24    \code{list} containing a collection of text documents.    \code{list} containing a collection of text documents.
25  }  }
26  \author{Ingo Feinerer \email{}}  \author{Ingo Feinerer}
27  \keyword{methods}  \keyword{methods}

Removed from v.61  
changed lines
  Added in v.62
ViewVC Help
Powered by ViewVC 1.0.0  
Thanks to:
Vienna University of Economics and Business Powered By FusionForge