SCM Repository

[tm] View of /pkg/ChangeLog
ViewVC logotype

View of /pkg/ChangeLog

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1445 - (download) (annotate)
Sun Oct 9 09:30:58 2016 UTC (6 years, 1 month ago) by feinerer
File size: 38916 byte(s)
Speed up termFreq(), general cleanup

- Avoid parallel::mclapply()
- Use custom .table()
- Use, rep_len() and lengths()
- Fix typos
- Shorten overlong lines
- Consistent formatting
2014-04-20  Ingo Feinerer <>

	* ChangeLog: Not maintained as a separate file anymore. Please consult
	the tm Subversion log messages (available at instead.

2014-02-25  Ingo Feinerer <>

	* NAMESPACE: Export pGetElem.URISource.

2014-02-23  Ingo Feinerer <>

	* R/complete.R (stemCompletion.PlainTextDocument): Avoid spurious
	duplicate results. Reported by Seong-Hyeon Kim.

2014-01-28  Ingo Feinerer <>

	* R/utils.R (map_IETF_Snowball): Process three letter codes.

2014-01-07  Ingo Feinerer <>

	* DESCRIPTION (Version): Prepare for CRAN New Year release.

2014-01-05  Ingo Feinerer <>

	* R/matrix.R (findAssocs): Allow multiple and non-existing terms.
	Suggested by Christian Buchta.

	* R/source.R (is.Source): New check for valid source.

2013-12-28  Ingo Feinerer <>

	* R/matrix.R (findAssocs): Make corlimit inclusive.

2013-09-27  Ingo Feinerer <>

	* R/source.R: Allow multiple URIs for URISource.

2013-09-19  Ingo Feinerer <>

	* R/source.R (Source): New Source constructor.

2013-08-26  Ingo Feinerer <>

	* R/source.R (DirSource): Report non-existent or non-readable files.
	Suggested by Ajinkya Kale and Milan Bouchet-Valat.

2013-08-19  Ingo Feinerer <>

	* R/corpus.R (setOldClass): Do not register VCorpus as S4 class

	* R/doc.R (setOldClass): Do not register PlainTextDocument as S4 class

2013-08-09  Ingo Feinerer <>

	* DESCRIPTION (License): Changed to GPL-3.

2013-07-25  Ingo Feinerer <>

	* R/complete.R (stemCompletion): Report NA instead of error when no
	completion can be found by the prevalent heuristic. Suggested by Hugh

2013-07-10  Ingo Feinerer <>

	* R/reader.R (readPDF): Use tm:::pdfinfo() (which needs the pdfinfo
	command line tool) instead of tools:::pdf_info().

2013-04-11  Ingo Feinerer <>

	* R/transform.R (removeWords): Use PCRE UCP to use Unicode properties
	to determine character types.

2012-12-14  Ingo Feinerer <>

	* R/matrix.R (TermDocumentMatrix): Ensure dimnames of type character
	when generating a simple_triplet_matrix. Reported by Arho Suominen. 

2012-12-10  Ingo Feinerer <>

	* man/tm_reduce.Rd: Document right to left folding order. Adapt
	example as well. Suggested by Mark Rosenstein.

2012-12-04  Ingo Feinerer <>

	* R/filter.R (sFilter): Avoid attach() and simplify.

2012-11-02  Ingo Feinerer <>

	* R/doc.R (.TextDocument): Use casts to ensure data types and to avoid
	removal of attributes.

2012-10-03 Ingo Feinerer  <>

	* R/weight.R (weightTfIdf, weightSMART): Gracefully handle empty
	columns and rows (avoids blow-up due to NaN values). Suggested by Jaap

2012-07-27 Ingo Feinerer  <>

	* R/transform.R (removeWords): Allow longer stopword lists.

2012-01-31  Ingo Feinerer  <>

	* R/reader.R (readXML): Readers can now set the document language

2012-01-14  Ingo Feinerer  <>

	* R/source.R (XMLSource, getElem.XMLSource): Simplifications as
	proposed by Milan Bouchet-Valat.

2012-01-11  Ingo Feinerer  <>

	* R/matrix.R (termFreq): Fix processing of user provided
	stopwords. Reported by Bettina Grün.

2011-12-23  Ingo Feinerer  <>

	* R/matrix.R (termFreq): Fix invalid handling of
	control$wordLengths[1]. Reported by Steven C. Bagley.

2011-12-17  Ingo Feinerer  <>

	* DESCRIPTION (Version): Prepare for CRAN Christmas release.

2011-12-12  Ingo Feinerer  <>

	* R/utils.R (map_IETF_Snowball): Map empty input to "porter".

2011-12-07  Ingo Feinerer  <>

	* R/transform.R (removePunctuation): Add option to preserve
	intra-word dashes.

2011-12-06  Ingo Feinerer  <>

	* R/matrix.R (termFreq): Allow reordering of control option

2011-11-17  Ingo Feinerer  <>

	* R/reader.R (readPDF): Use tools:::pdf_info() instead of external
	pdfinfo tool.

	* inst/stopwords/SMART.dat: Add SMART information retrieval system
	stopwords (which are also used by the MC toolkit).

	* R/matrix (termFreq): Allow local option \code{bounds$local} to
	restrict how often a term may appear in each document (generalizes
	\code{minDocFreq}). Similarly the local option \code{wordLengths}
	for word length bounds (generalizes \code{minWordLength}).

	* R/matrix.R (TermDocumentMatrix.VCorpus): New global option
	\code{bounds$global} for restricting how often a term is allowed
	to appear in different documents.

	* R/matrix.R (TermDocumentMatrix.VCorpus): Distinguish between
	local options delegated internally to termFreq() and global
	options which are processed by the term-document matrix
	constructor itself.

2011-11-15  Ingo Feinerer  <>

	* man/getTokenizers.Rd: Document getTokenizers().

	* man/tokenizer.Rd: Document MC_tokenizer() and scan_tokenizer().

2011-11-04  Ingo Feinerer  <>

	* man/matrix.Rd: Document as.TermDocumentMatrix.term_frequency.

	* man/combine.Rd: Document c.term_frequency().

2011-10-11  Ingo Feinerer  <>

	* R/meta.R (`meta<-.Corpus`): Assume that the replacement value
	can be accessed via '[' and not '[['.

2011-08-24  Ingo Feinerer  <>

	* R/stopwords.R (stopwords): Raise an error if no stopwords are
	available for requested language. Suggested by Derek M Jones.

2011-05-27  Ingo Feinerer  <>

	* R/weight.R (weightSMART): Implement Cosine and pivoted unique

2011-02-17  Ingo Feinerer  <>

	* R/transform.R (stemDocument.PlainTextDocument): Use language

2011-02-04  Ingo Feinerer  <>

	* R/source.R: Store strings and connections instead of unevaluated

2010-11-26  Ingo Feinerer  <>

	* R/corpus.R (Corpus): Allow init and exit hooks for readers.

2010-10-22  Ingo Feinerer  <>

	* R/matrix.R (.TermDocumentMatrix): Make Weighting an attribute
	(instead of a list element).

2010-10-16  Ingo Feinerer  <>

	* R/corpus.R (`[[.VCorpus`, `[[.PCorpus'): Access individual
	documents by names (fallback to IDs if names are not set).

2010-08-25  Ingo Feinerer  <>

	* R/corpus.R (c.Corpus): When concatenating corpora, the argument
	\code{recursive} now determines whether existing corpus metadata
	is used.

2010-08-06  Ingo Feinerer  <>

	* R/transform.R: Removed convert_UTF_8(). Use enc2utf8() instead.

2010-06-17  Ingo Feinerer  <>

	* R/matrix.R (TermDocumentMatrix): If a dictionary is given do not
	remove terms not occurring in the corpus anymore.

2010-06-02  Ingo Feinerer  <>

	* R/plot.R (Zipf_plot, Heaps_plot): Plotting functions for Zipf's
	and Heaps' law.

2010-05-18  Ingo Feinerer  <>

	* R/corpus.R (Corpus, PCorpus): Use element names as IDs if
	provided by a source.

2010-04-09  Ingo Feinerer  <>

	* R/source.R (.Source): Provide document names.

2010-04-07  Ingo Feinerer  <>

	* R/meta.R (`content_or_meta`): Utility function.

2010-03-19  Ingo Feinerer  <>

	* R/reader.R (readReut21578XML, readReut21578XMLasPlain): Extract

2010-03-03  Ingo Feinerer  <>

	* R/weight.R (weightTfIdf): Added normalization option.

	* man/tm_tag_score.Rd: Add General Inquirer example for sentiment

2010-02-25  Ingo Feinerer  <>

	* R/score.R (tm_tag_score): Compute a score from the number of
	tags matching in a document.

2010-02-18  Ingo Feinerer  <>

	* R/complete.R (stemCompletion): New completion heuristics.

2010-02-17  Ingo Feinerer  <>

	* R/plot.R (plot.TermDocumentMatrix): Memory improvements.

2010-02-06  Ingo Feinerer  <>

	* DESCRIPTION (Depends): Depend on R (>= 2.10.0) to ensure that
	setOldClass(c(..., "list")) works.

2010-01-22  Ingo Feinerer  <>

	* R/transform.R (stemDocument.character): In case input is a
	simple character just delegate to the default Snowball stemmer.

2010-01-15  Ingo Feinerer  <>

	* R/reader.R (readReut21578XML, readRCV1): Extract more meta

2010-01-12  Ingo Feinerer  <>

	* R/doc.R (`Content<-`): Be careful with names attribute.

2010-01-07  Stefan Theussl  <>

	* R/source.R (DirSource): Improved implementation especially when
	handling many (> 1M) files.

2009-12-22  Ingo Feinerer  <>

	* R/source.R (getElem.URISource): Use encoding argument.

2009-12-11  Ingo Feinerer  <>

	* R/doc.R (setOldClass): Register S3 document classes to be
	recognized by S4 methods.

2009-11-25  Ingo Feinerer  <>

	* R/matrix.R (termFreq): Add option to remove punctuation

2009-11-19  Ingo Feinerer  <>

	* R/matrix.R (c.TermDocumentMatrix): Added combine method for
	merging multiple term-document matrices.

2009-11-17  Ingo Feinerer  <>

	* R/corpus.R (setOldClass): Register S3 corpus classes to be
	recognized by S4 methods.

	* man/plot.Rd: Use \dontrun{} in \examples{} section in the hope
	that CRAN Mac OS X builds do not fail any longer.

2009-11-15  Ingo Feinerer  <>

	* R/matrix.R (tokenize): Use scan(..., what = "character") instead
	of RWeka:AlphabeticTokenizer() as default.

2009-11-14  Ingo Feinerer  <>

	* R/transform.R (removeWords.PlainTextDocument): Fix bug which
	caused words at the beginning or the end of a line not to be removed. Do
	not delete whitespace anymore.

2009-11-12  Ingo Feinerer  <>

	* R/source.R (DirSource): Default to working directory if no path
	is specified.

2009-11-11  Ingo Feinerer  <>

	* R/source.R (DirSource): Stop on empty directories.

2009-11-07  Ingo Feinerer  <>

	* R/matrix.R (TermDocumentMatrix): Avoid prefixes originating from
	named documents.

2009-10-21  Ingo Feinerer  <>

	* R/transform.R (removeWords): Improve regular expressions.

2009-10-19  Ingo Feinerer  <>

	* R/meta.R (DublinCore): Allow lower case tags.

2009-10-09  Ingo Feinerer  <>

	* R/source.R (GmaneSource, ReutersSource): Use xmlChildren(x)
	instead of x$children.

2009-09-15  Ingo Feinerer  <>

	* R/preprocess.R (preprocessReut21578XML): Fix generated file names.

2009-09-06  Ingo Feinerer  <>

	* R/: Use S3 instead of S4 class system.

2009-08-11  Ingo Feinerer  <>

	* R/reader.R (readMail): Moved to tm.plugin.mail package.

2009-07-04  Ingo Feinerer  <>

	* R/reader.R (readNewsgroup): Rename to readMail as newsgroup
	postings are basically e-mails with some extra headers.

2009-07-03  Ingo Feinerer  <>

	* R/transform.R: Move convertMboxEml, removeCitation,
	removeMultipart, and removeSignature to the tm.plugin.mail package
	since they are mainly utility functions (for handling e-mails) and
	not very framework specific.

2009-06-28  Ingo Feinerer  <>

	* man/: Fix documentation.

2009-06-26  Ingo Feinerer  <>

	* R/reader.R (readReut21578XMLasPlain): New reader which returns a
	plain text document instead of an XML document for texts of the
	Reuters-21578 dataset.

	* R/sparse.R: Removed since the slam package is now available on

	* DESCRIPTION (Depends): Add slam package.

2009-06-17  Ingo Feinerer  <>

	* R/transform.R (stemDoc): Fix character(0) handling.

2009-06-12  Ingo Feinerer  <>

	* R/doc.R (show): Pretty print.

2009-05-27  Ingo Feinerer  <>

	* R/matrix.R (print.TermDocumentMatrix): Handle empty matrices

2009-05-13  Ingo Feinerer  <>

	* R/corpus.R: Make corpus virtual. Implement corpus with standard
	and permanent storage semantics.

	* DESCRIPTION: New major release. A *lot* of improvements.

2009-05-04   Ingo Feinerer <>

	* NAMESPACE: Export some simple_triplet_matrix functions.

2009-04-28   Ingo Feinerer <>

	* R/weight.R: Adapt tf-idf to new matrix format.

2009-04-27  Ingo Feinerer  <>

	* R/matrix.R: Create two distinct classes for term-document and
	document-term matrices.

2009-04-26  Ingo Feinerer  <>

	* R/termdocmatrix.R: No longer use Matrix package. This reduces
	package start-up time significantly.

2009-04-11  Ingo Feinerer  <>

	* inst/doc/tm.Rnw: Fix code/documentation mismatch.

2009-04-04  Ingo Feinerer  <>

	* R/transform.R (tmReduce): Combine multiple maps into one

2009-04-03  Ingo Feinerer  <>

	* R/weight.R: Remove weightLogical since it does not return a

	* R/termdocmatrix.R: Removed TermDocMatrix. Use DocumentTermMatrix
	or TermDocumentMatrix instead.

2009-03-28  Ingo Feinerer  <>

	* inst/doc/extensions.Rnw: Finished vignette.

2009-03-27  Ingo Feinerer  <>

	* R/termdocmatrix.R: Start to work on new TermDocumentMatrix and
	DocumentTermMatrix representations.

2009-03-23  Ingo Feinerer  <>

	* R/reader.R (readXML): New reader for arbitrary XML files.

2009-03-22  Ingo Feinerer  <>

	* R/source.R (CSVSource): Defunct (use DataframeSource instead).
	(XMLSource): New XMLSource class for arbitrary XML files.
	(Source): New slot Vectorized.

2009-03-21  Ingo Feinerer  <>

	* R/reader.R (readTabular): Experimental reader for tabular data
	structures which can be customized via user-defined mappings.

	* R/reader.R: Always use UTC time zone.

	* R/AAA.R (.onLoad): No longer try to start a MPI cluster.

2009-03-20  Ingo Feinerer  <>

	* R/reader.R (readDOC): Options can be passed over to antiword.

	* R/reader.R (readPDF): Options can be passed over to pdfinfo and

2009-03-10  Ingo Feinerer  <>

	* R/source.R (DirSource): Add pattern and arguments
	which are internally passed over to list.files().

2009-03-02  Ingo Feinerer  <>

	* inst/doc/tm.Rnw: Suppress pointless loading message.

2009-01-29  Ingo Feinerer  <>

	* DESCRIPTION: Speed up package loading (via moving packages not
	strictly necessary for normal operation to Suggests instead of

2009-01-08  Ingo Feinerer  <>

	* R/reader.R (readNewsgroup): The date format is now configurable.

2008-12-20  Ingo Feinerer  <>

	* R/preprocess.R (convertMboxEml): Fix off-by-one error.

2008-12-16  Ingo Feinerer  <>

	* R/termdocmatrix.R (TermDocMatrix): Sort row indices.

2008-12-06  Ingo Feinerer  <>

	* R/source.R (DataframeSource): New source class for data frames.

	* R/source.R: Fixed non-standard call evaluation.

2008-11-29  Ingo Feinerer  <>

	* R/source.R (URISource): New source class for a single document.

2008-11-27  Ingo Feinerer  <>

	* R/source.R: Refactoring.

2008-11-25  Ingo Feinerer  <>

	* R/AAA.R (.onLoad, .Last): Use tryCatch() to handle misconfigured
	Rmpi installations more gracefully.

2008-11-08  Ingo Feinerer  <>

	* R/source.R (Source): Add Length slot.

2008-11-06  Ingo Feinerer  <>

	* R/AAA.R: Unify duplicated .onLoad function.

2008-11-03  Ingo Feinerer  <>

	* DESCRIPTION (Suggests): Added Rmpi.

2008-11-02  Ingo Feinerer  <>

	* R/source.R (getElem): Fix 'no visible binding' warning.

	* man/WeightFunction.Rd: Fix signature.

2008-08-03  Ingo Feinerer  <>

	* R/weight.R: Introduce name abbreviations for weighting functions.

2008-07-24  Ingo Feinerer  <>

	* R/AAA.R (.onLoad, .Last): Start and stop MPI cluster.

	* R/cluster.R: Provide convenience functions for using a MPI

	* R/termdocmatrix.R (TermDocMatrix): Use MPI cluster if

	* R/textdoccol.R (tmIndex, tmFilter, tmMap): Use MPI cluster if

2008-07-17  Ingo Feinerer  <>

	* R/textdoccol.R (lapply): Removed debug print out.

2008-06-06  Ingo Feinerer  <>

	* R/reader.R (readRCV1): Improved metadata extraction from
	Reuters Corpus Volume 1 documents.

2008-05-25  Ingo Feinerer  <>

	* R/transform.R: Ensure that all mappings preserve multiline

2008-05-24  Ingo Feinerer  <>

	* R/filter.R: Every filter has now an attribute indicating whether
	it sould be applied to document level (doclevel).

	* R/textdoccol.R (tmFilter): Set searchFullText as new default

2008-04-23  Ingo Feinerer  <>

	* R/transform.R (replacePatterns): Replaced removeWords by
	replacePatterns. Suggested by Christian Buchta.

	* R/textdoccol.R (inspect): Improved formatting.

2008-04-19  Ingo Feinerer  <>

	* inst/CITATION: Updated JSS article information.

	* R/textdoccol.R (setAs): Added coerce method from list to

	* R/meta.R (meta): Improved metadata handling.

2008-03-21  Ingo Feinerer  <>

	* R/textdoccol.R (materialize, tmMap): Improvements suggested by
	Christian Buchta.

	* inst/CITATION: Added template to include JSS article reference.

2008-03-12  Ingo Feinerer  <>

	* R/textdoccol.R (tmMap): Introduced lazy mapping.

	* R/source.R: Added VectorSource.

2008-02-23  Ingo Feinerer  <>

	* man/: Language codes should be in ISO 639-1 format.

	* R/textdoccol.R (asPlain): Preserve local metadata.

2008-01-31  Ingo Feinerer  <>

	* R/textdoccol.R (writeCorpus): Function for writing a corpus
	containing plain text documents to disk.

2008-01-30  Ingo Feinerer  <>

	* R/termdocmatrix.R (TermDocMatrix): Ensure that dimnames are
	always set correctly.

	* R/textdoccol.R: Set load = TRUE as default for load on demand
	since in most cases this is the wanted behaviour.

2008-01-24  Ingo Feinerer  <>

	* R/: Renamed TextDocCol to Corpus, and Corpus to Content.

	* DESCRIPTION: Updated Version to 0.3 due to core name changes.

2008-01-22  Ingo Feinerer  <>

	* R/meta.R (meta): New function for consistent access to metadata
	of document collections, repositories, and texts.

2008-01-21  Ingo Feinerer  <>

	* R/: Better support for encodings.

2008-01-13  Ingo Feinerer  <>

	* R/textdoccol.R (TextDocCol): Fixed bug regarding default reader
	selection when no reader argument is given.

2008-01-05  Ingo Feinerer  <>

	* R/source.R (CSVSource): Now uses read.csv instead of scan

2008-01-02  Ingo Feinerer  <>

	* R/reader.R (getReaders): Returns available reader functions.

	* R/termdocmatrix.R (TermDocMatrix): Set new modular constructor
	as default.

2007-12-02  Ingo Feinerer  <>

	* R/stopwords.R (stopwords): Shortened code, removed codetools
	variable warnings.

	* man/: Documentation for showMeta, added an example for tmMap.

	* inst/doc/tm.Rnw: Updated vignette, comments on MS word reader,
	some minor typos fixed.

2007-12-01  Ingo Feinerer  <>

	* R/aobjects.R (showMeta): Added method for pretty printing a
	text document's metadata.

2007-11-29  Ingo Feinerer  <>

	* R/textdoccol.R (TextDocCol): Better handling of empty

	* NAMESPACE: Exported readDOC.

	* man/completeStems.Rd: Added an example.

2007-11-18  Ingo Feinerer  <>

	* R/stopwords.R (stopwords): Look up .dat files at every
	call. Allows users to modify stopword .dat files interactively.

2007-11-06  Ingo Feinerer  <>

	* R/termdocmatrix.R (termFreq): Correct processing of empty

2007-10-27  Ingo Feinerer  <>

	* man/: Updated documentation.

2007-10-21  Ingo Feinerer  <>

	* R/complete.R (completeStems): Completes (heuristically) word

	* R/termdocmatrix.R (TermDocMatrix2): New modular

	* NAMESPACE: Exported termFreq.

2007-10-16  Ingo Feinerer  <>

	* R/reader.R (readDOC): Added MS Word reader (using antiword).

2007-10-14  Ingo Feinerer  <>

	* R/weight.R: Weighting functions for TermDocMatrix.

2007-10-13  Ingo Feinerer  <>

	* R/termdocmatrix.R (dimnames, colnames, rownames): Wrapper
	functions for accessing dimension, column, and row names.

	* R/plot.R (plot.TermDocMatrix): Plot correlations between terms.

2007-09-11  Ingo Feinerer  <>

	* man/removePunctuation.Rd: Added documentation. Function also exported to NAMESPACE.

2007-08-28  Ingo Feinerer  <>

	* R/fungen.R: Use S4 class for function generators instead of S3 attributes.

2007-07-29  Ingo Feinerer  <>

	* R/reader.R (readPDF): Removed manual checks for pdftotext and
	pdfinfo. The system call gives a warning anyway.

2007-07-28  Ingo Feinerer  <>

	* R/textdoccol.R (asPlain): Conversion from
	StructuredTextDocuments to PlainTextDocuments.

2007-07-21  Ingo Feinerer  <>

	* R/termdocmatrix.R: Added convenience methods ("[", nrow, ncol)
	for accessing term-document matrices.

	* inst/doc/tm.Rnw: readPDF is only called if pdftotext and pdfinfo
	are installed.

2007-07-17  Ingo Feinerer  <>

	* R/termdocmatrix.R (TermDocMatrix): Improved efficiency. Kudos to
	Christian Buchta.

2007-07-15  Ingo Feinerer  <>

	* inst/doc/tm.Rnw: Update vignette (readPDF, readHTML, preprocessReut21578XML).

2007-07-14  Ingo Feinerer  <>

	* R/reader.R (readHTML): Added very simple HTML reader to obtain StructuredTextDocuments.

	* R/reader.R (readPDF): Added PDF reader.

2007-07-13  Ingo Feinerer  <>

	* DESCRIPTION: Moved proxy from Depends to Imports to avoid name clashes.

	* inst/stopwords/english.dat: Added the term "yes" to stopwords.

	* R/termdocmatrix.R (dim): dim function for TermDocMatrix.

	* R/preprocess.R (convertMboxEml): Accepts gzipped mboxes.

2007-07-11  Ingo Feinerer  <>

	* R/distmeasure.R (dissimilarity): Replaced dists call from
	package cba by new dist call from package proxy.

2007-07-10  Ingo Feinerer  <>

	* inst/doc/tm.Rnw: Described removeSparseTerms and Dictionary.

2007-06-21  Ingo Feinerer  <>

	* R/termdocmatrix.R: require() uses the quietly option to suppress
	loading messages.

2007-06-12  Ingo Feinerer  <>

	* R/dictionary.R: Added dictionary support.

2007-06-07  Ingo Feinerer  <>

	* R/aobjects.R: Added classes for Reuters21578 XML and RCV1
	documents. This simplifies some functions, e.g., asPlain.

2007-06-06  Ingo Feinerer  <>

	* inst/doc/tm.Rnw: Fixed some typos in vignette.

2007-06-03  Ingo Feinerer  <>

	* R/textdoccol.R (replaceWords): Added method to replace a set of
	words by a single word. Useful for synonyms.

2007-05-22  Ingo Feinerer  <>

	* man/TermDocMatrix.Rd: Fixed documentation on Data slot.

2007-05-19  Ingo Feinerer  <>

	* R/termdocmatrix.R (textvector): Small fix for dealing with empty
	vectors. Thanks to Ariel Maguyon for his error report.
	(removeSparseTerms): New function to remove columns from a
	term-document matrix exceeding a sparse factor.

2007-05-15  Ingo Feinerer  <>

	* man/tmUpdate.Rd: Corrected documentation on readerControl parameter.

2007-05-11  Ingo Feinerer  <>

	* man/sFilter.Rd: Corrected documentation on statement format (use
	'==' instead of '=').

2007-05-08  Ingo Feinerer  <>

	* R/aobjects.R (StructuredTextDocument): Inherits from

2007-05-04  Ingo Feinerer  <>

	* R/termdocmatrix.R (findFreqTerms): Perform efficient computation
	on sparse matrices as proposed by Martin Maechler.

2007-04-27  Ingo Feinerer  <>

	* R/textdoccol.R: Removed \code{dbDisconnect} calls since last
	\pkg{filehash} version makes them deprecated.

2007-04-22  Ingo Feinerer  <>

	* R/termdocmatrix.R (textvector): Stemming is now performed before
	erasing stopwords.
	(weightMatrix): Adapted to handle sparse matrices.
	(TermDocMatrix): Sparse matrix is now efficiently built by
	direct stepwise insertion of row values into it.

2007-04-21  Ingo Feinerer  <>

	* DESCRIPTION: Replaced \pkg{filehashSQLite} with \pkg{filehash}
	due to ongoing problems. For our purposes the latter is as useful
	as the replaced package.

2007-04-20  Ingo Feinerer  <>

	* man/TextDocCol.Rd: Replaced \code{readPlain} with \code{object@DefaultReader}.

	* man/TermDocMatrix.Rd: Remove deprecated \code{language} argument.

2007-04-15  Ingo Feinerer  <>

	* R/resolve.R (resolveISOCode): Added ISO 639-1 codes for
	languages with available stopwords.

2007-04-14  Ingo Feinerer  <>

	* inst/doc/tm.Rnw: Minor corrections in the vignette.

2007-04-11  Ingo Feinerer  <>

	* DESCRIPTION: Update to version 0.2, since a lot of new features
	have been integrated.

	* inst/stopwords: Updated existing stopwords and added stopwords
	for various other languages.

2007-04-10  Ingo Feinerer  <>

	* man/: Updated documentation.

	* Work/testDb.R: Script to test database stuff.

	* R/: Fixed various database related bugs. Seems to be rather
	useable now, i.e., consider as alpha status for now.

2007-04-08  Ingo Feinerer  <>

	* R/: Fixed some bugs related to database support.

2007-04-07  Ingo Feinerer  <>

	* man/: Added a lot of examples to the manuals.

2007-04-05  Ingo Feinerer  <>

	* man/: Updated parts of the documentation.

	* R/textdoccol.R (asPlain): Added conversion from newsgroup
	documents to plain text documents.

2007-04-01  Ingo Feinerer  <>

	* R/textdoccol.R: Finished experimental database support. Not yet
	intensively tested.

	* R/source.R: Now each source has a default reader.

	* R/reader.R: \code{FunctionGenerator} is now an attribute, not a
	class anymore.

	* R/plaintextdoc.R: Custom show method for plain text documents.

	* R/aobjects.R: Added a class for structured text documents.

	* R/reader.R: Replaced remaining \code{parser} occurrences with

	* R/textdoccol.R (summary): Indent tags. 

	* R/textdoccol.R (removePunctuation): Transform method to remove
	punctuation marks.

2007-03-21  Ingo Feinerer  <>

	* R/textdoccol.R (sFilter): Simplified sFilter significantly by
	using prescindMeta().

2007-03-18  Ingo Feinerer  <>

	* R/textdoccol.R: Improved database support.

2007-03-16  Ingo Feinerer  <>

	* R/termdocmatrix.R (TermDocMatrix): Uses sparse matrices.

	* R/resolve.R (resolveISOcode): Extracts the language from a ISO
	language code.

	* R/textdoccol.R (TextDocCol): Refactored several parser arguments
	into parserControl argument.

	* R/aobjects.R (TextDocument): Introduced the "Language" slot.

2007-03-14  Ingo Feinerer  <>

	* Work/tmDataSetup.R: The datasets acq and crude can now be
	created on the fly.

	* R/stopwords.R: Introduced a function returning the stopwords for
	a given language (English, German and French at the moment)

	* R/textdoccol.R (stemDoc): Stemming uses Rstem if available,
	otherwise falls back to Snowball package.

2007-01-30  Ingo Feinerer  <>

	* man/dissimilarity-methods.Rd: Make clear that any method offered
	by "dists" from package "cba" can be used.

2007-01-22  Ingo Feinerer  <>

	* inst/doc/tm.Rnw: Fixed quotes-appearing-as-boxes-bug according
	to Kurt's latex suggestion. Removed points and underscores in
	variable names for consistent naming.

	* DESCRIPTION: Update to version 0.1-2.

	* man/TextRepository.Rd: Fixed bug in documentation.

2007-01-12  Ingo Feinerer  <>

	* DESCRIPTION: Update to version 0.1-1.

2007-01-09  Ingo Feinerer  <>

	* R/textdoccol.R (stemDoc): Use Rstem::wordStem instead of

2007-01-06  Ingo Feinerer  <>

	* R/: Changes due to Kurt's review.

2006-12-31  Ingo Feinerer  <>

	* R/: Implemented improvements based upon comments by David

2006-12-17  Ingo Feinerer  <>

	* inst/doc/: Rewrote vignette.

	* man/: Improved documentation.

2006-12-16  Ingo Feinerer  <>

	* man/: Updated documentation.

	* DESCRIPTION: Changed package name to "tm". Updated version to
	0.1 for first CRAN release.

	* inst/texts/gmane.comp.lang.r.general.mbox: mbox Gmane R mailing
	list archive example.

	* inst/texts/ RSS Gmane R mailing list
	archive example.

	* R/preprocess.R (convert_mbox_eml): A simple e-mail converter
	from (several mails per box) mbox format to (single mail per file)
	eml format.

2006-12-08  Ingo Feinerer  <>

	* data/crude.rda: Rebuilt.

	* data/acq.rda: Rebuilt.

	* R/reader.R: Factored out reader and parser methods from

	* R/source.R: Factored out Source methods from aobjects.R and
	(GmaneRSource): Encapsulates Gmane R mailing list archive RSS

	* R/textdoccol.R (DirSource): Added support for recursive
	traversal of directories.

2006-12-07  Ingo Feinerer  <>

	* R/textdoccol.R ([[): Loads the document corpus automatically
	into memory upon access.
	(tm_transform, tm_filter): Removed several checks whether the
	document is already loaded ([[ ensures this now).
	(gmane_r_reader): Reader for RSS feeds as provided by the Gmane R
	mailing list archive.

2006-12-06  Ingo Feinerer  <>

	* R/aobjects.R (TextDocument): Is now a virtual class.
	(Source): Is now a virtual class.

2006-12-05  Ingo Feinerer  <>

	* R/textdoccol.R (c): Support for an arbitrary number of document

2006-11-26  Ingo Feinerer  <>

	* R/textrepo.R: Updated TextRepository (constructor), append_elem,
	append_meta and remove_meta.

	* R/textdoccol.R: Removed modify_metadata method.

	* R/textrepo.R: Removed modify_metadata method.

	* R/textdoccol.R (remove_meta): Supports removal of document
	collection metadata and document (= in data frame) metadata.

2006-11-23  Ingo Feinerer  <>

	* R/textdoccol.R (append_doc): Bug fix for handling empty metadata.

	* data/crude.rda: Rebuilt.

	* data/acq.rda: Rebuilt.

	* inst/doc/textmin.Rnw: Updated vignette to reflect code changes.

	* R/textdoccol.R ([): Bug fix for subsetting a document
	collection's data frame.

2006-11-22  Ingo Feinerer  <>

	* R/textdoccol.R: Bug fixes in s_filter. Added full query support
	to s_filter.

	* R/textdoccol.R: Local text documents' metadata can now be copied
	to a document collection's data frame with prescind_meta.

2006-11-21  Ingo Feinerer  <>

	* R/: Text documents' slot metadata is now accessible in s_filter.

	* R/: Rewrote s_filter function (has still some restrictions).

2006-11-20  Ingo Feinerer  <>

	* R/: Various fixes in handling metadata.

	* R/: Added update mechanism for text document collections.

2006-11-19  Ingo Feinerer  <>

	* R/: Merging of document collections now creates a binary tree
	for reconstructing merged document collections.

	* R/: Redesign of metadata for document collections.

2006-11-07  Ingo Feinerer  <>

	* R/: Messages now use \code{ngettext}.

2006-11-03  Ingo Feinerer  <>

	* R/: Added functions for modifying and removing metadata.

2006-11-01  Ingo Feinerer  <>

	* man/: Updated some documentation.

	* R/: Corrected some connection issues.

	* inst/doc: Worked on the vignette.

2006-10-31  Ingo Feinerer  <>

	* inst/: Added texts and started vignette.

	* R/: Final changes based upon David's comments.

2006-10-29  Ingo Feinerer  <>

	* NAMESPACE: Corrected exports (generic methods need exportMethods

2006-10-26  Ingo Feinerer  <>

	* R/: Modified the TextDocCol constructur and various parsers. It
	is now modular and supports various file formats via plugins (see
	the new "Source" class).

2006-10-24  Ingo Feinerer  <>

	* man/: Revised documentation after previous code changes.

2006-10-23  Ingo Feinerer  <>

	* R/: Remaining changes as discussed with David.

2006-10-22  Ingo Feinerer  <>

	* R/: Some changes as suggested by David. The rest will follow
	within the next days.

2006-09-26  Ingo Feinerer  <>

	* man/: Finished documentation.

2006-09-25  Ingo Feinerer  <>

	* man/: Wrote some documentation.

2006-09-24  Ingo Feinerer  <>

	* R/: Further syntactic sugar in form of additional assignment and
	accessor methods.

2006-09-13  Ingo Feinerer  <>

	* R/: Syntactic sugar in form of "length", "show" and "summary"

2006-08-24  Ingo Feinerer  <>

	* R/: Diverse updates. Mainly on default operators ("[" or "c")
	and dissimilarities.

2006-08-12  Ingo Feinerer  <>

	* R/: Added similarity functions.

	* data/: Added english stopwords.

2006-08-07  Ingo Feinerer  <>

	* data/: Examples compiled for new features

	* R/: Changes due to new structure.

	* NAMESPACE: Corrected namespace to reflect new structure.

	* R/termdocmatrix.R: Adapted for new naming scheme.

2006-08-06  Ingo Feinerer  <>

	* R/textdoccol.R: Adapted code for new class structure. Wrote
	several transform and filter functions operating on text document
	collections (alias text document databases).

	* R/aobjects.R: Adapted class structure with inheritance,
	repositories and additional metadata. Loading files on demand is
	now possible.

2006-07-13  Ingo Feinerer  <>

	* R/: Some cosmetic cleanups.

	* inst/: Removed vignette on clustering. That and much more is now
	described in the JSS paper on text mining. Based upon that
	article an elaborated vignette will be incorporated in the future.

2006-07-01  Ingo Feinerer  <>

	* R/: Updated generic S4 methods to comply with signature changes
	in newer versions of R (> 2.3)

2006-03-12  Ingo Feinerer  <>

	* ext/R/importRIS.R: Automatic RIS import is now possible.

2006-02-14  Ingo Feinerer  <>

	* R/textdoccol.R: Added RIS HTML input format.

2006-01-21  Ingo Feinerer  <>

	* R/textdoccol.R: Removed bug that caused invalid text document
	collections when handling many input files.

2006-01-11  Ingo Feinerer  <>

	* R/textdoccol.R: Restructured and extended file import

	* inst/doc/clustering.Rnw: Adapted vignette for use with

	* man/ReutNews.Rd: Documentation for ReutNews.rda

	* data/ReutNews.rda: A tiny Reuters21578 example data set.

2005-12-22  Ingo Feinerer  <>

	* inst/doc/clustering.Rnw: Wrote a small vignette to present the
	clustering facilities of this package.

2005-12-15  Ingo Feinerer  <>

	* R/aobjects.R: Changed package document structure to avoid class
	dependency problems.

2005-12-06  Ingo Feinerer  <>

	*  Wrote a script for the ModLewis Split for the Reuters-21578 XML
	data set.

	*  Finished documentation and reordered directory structure. Now "R
	CMD check textmin" works without errors.

2005-12-04  Ingo Feinerer  <>

	* src/: Various splits can now be easily created for the
	Reuters21578 data set.

2005-12-03  Ingo Feinerer  <>

	*  Updated documentation

2005-11-30  Ingo Feinerer  <>

	*  Wrote R documentation for some classes and methods.

2005-11-19  Ingo Feinerer  <>

	* R/textdoccol.R: Constructor of textdoccol allows import of CSV
	files. See the questionnaire data/Umfrage.csv for such an example.
	We are now able to import files in Reuters-21578 XML format.

	*  Changed class interfaces in various files. Weighting of the text
	matrix is now possible.

2005-11-08  Ingo Feinerer  <>

	* R/textdoccol.R: One can build term-document matrices if
	nessecary (with buildTDM(...)) and fill the field tdm from a text
	document collection with it.

	* R/textmatrix.R: Wrote S4 class for term-document matrices.

2005-11-06  Ingo Feinerer  <>

	* R/textdoccol.R: We now can read in a whole XML file with several
	news items.

2005-11-05  Ingo Feinerer  <>

	* R/textdoccol.R: Set up an S4 class for a collection of text
	documents. A first attempt to read in XML input (like the RCV1
	set) was made.

	* R/textdocument.R: Set up an S4 class for text documents. Wrote
	some accessor functions.

	* data/newsitem.xml: Added this XML file for testing purposes. It
	contains a single news item from the Reuters Corpus Volume 1
	(RCV1) XML set.

2005-10-07  Ingo Feinerer  <>

	* R/textmatrix.R (textmatrix): Removed the transpose of the original
	textmatrix as k-means clustering provided by R (kmeans) now works on
	this textmatrix. The result is a k-means text clustering with a
	similarity measure based upon word frequences.

2005-10-05  Ingo Feinerer  <>

	* R/textmatrix.R: Adapted the preprocessing code from the R
	package "lsa" written by Fridolin Wild to build a document text matrix.

2005-10-02  Ingo Feinerer  <>

	*  Set up the R Text Mining Package infrastructure.
ViewVC Help
Powered by ViewVC 1.0.0  
Thanks to:
Vienna University of Economics and Business Powered By FusionForge