SCM

SCM Repository

[tm] Log of /pkg/R/transform.R
[tm] / pkg / R / transform.R  
ViewVC logotype

Log of /pkg/R/transform.R

Parent Directory Parent Directory


Links to HEAD: (view) (download) (annotate)
Sticky Revision:

Revision 1445 - (view) (download) (annotate) - [select for diffs]
Modified Sun Oct 9 09:30:58 2016 UTC (2 years, 4 months ago) by feinerer
File length: 3709 byte(s)
Diff to previous 1443
Speed up termFreq(), general cleanup

- Avoid parallel::mclapply()
- Use custom .table()
- Use rep.int(), rep_len() and lengths()
- Fix typos
- Shorten overlong lines
- Consistent formatting

Revision 1443 - (view) (download) (annotate) - [select for diffs]
Modified Mon Aug 22 11:26:41 2016 UTC (2 years, 5 months ago) by feinerer
File length: 3713 byte(s)
Diff to previous 1438
Process all arguments in tm_map.SimpleCorpus()

Revision 1438 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jul 16 18:32:59 2016 UTC (2 years, 7 months ago) by feinerer
File length: 3708 byte(s)
Diff to previous 1437
Use Rcpp for efficient term-document matrix construction from a SimpleCorpus

Revision 1437 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 13 19:23:49 2016 UTC (2 years, 7 months ago) by feinerer
File length: 3697 byte(s)
Diff to previous 1413
Add SimpleCorpus

SimpleCorpus provides a corpus which is optimized for the most common usage
scenario: importing plain texts from files in a directory or directly from a
vector in R, preprocessing and transforming the texts, and finally exporting
them to a term-document matrix. The aim is to boost performance and minimize
memory pressure. It loads all documents into memory, and is designed for
medium-sized to large data sets.

Revision 1413 - (view) (download) (annotate) - [select for diffs]
Modified Sat Apr 4 08:21:38 2015 UTC (3 years, 10 months ago) by feinerer
File length: 3609 byte(s)
Diff to previous 1369
Correctly process words being truncations of others

Revision 1369 - (view) (download) (annotate) - [select for diffs]
Modified Tue Apr 29 07:42:53 2014 UTC (4 years, 9 months ago) by feinerer
File length: 3567 byte(s)
Diff to previous 1358
Fallback to English if meta(doc, "language") is invalid

Revision 1358 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 24 07:43:38 2014 UTC (4 years, 9 months ago) by feinerer
File length: 3378 byte(s)
Diff to previous 1354
Document content_transformer()

Revision 1354 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 23 16:25:03 2014 UTC (4 years, 9 months ago) by khornik
File length: 3337 byte(s)
Diff to previous 1332
Nullify rather than remove the 'lazy' component if everything is materialized.

Revision 1332 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 18 09:00:55 2014 UTC (4 years, 10 months ago) by feinerer
File length: 3328 byte(s)
Diff to previous 1315
Update TextDocument documentation

Revision 1315 - (view) (download) (annotate) - [select for diffs]
Modified Mon Mar 31 08:38:05 2014 UTC (4 years, 10 months ago) by feinerer
File length: 3352 byte(s)
Diff to previous 1314
Simplify tm_map, tm_filter, and tm_index; remove makeChunks; rework lazy maps

Revision 1314 - (view) (download) (annotate) - [select for diffs]
Modified Sun Mar 30 09:33:30 2014 UTC (4 years, 10 months ago) by feinerer
File length: 4457 byte(s)
Diff to previous 1313
Fixes

Revision 1313 - (view) (download) (annotate) - [select for diffs]
Modified Sun Mar 30 09:28:00 2014 UTC (4 years, 10 months ago) by feinerer
File length: 4457 byte(s)
Diff to previous 1311
content() and as.list() now give the full documents

Revision 1311 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 27 14:15:08 2014 UTC (4 years, 10 months ago) by feinerer
File length: 4673 byte(s)
Diff to previous 1310
Some bug fixes

Revision 1310 - (view) (download) (annotate) - [select for diffs]
Modified Wed Mar 26 19:23:13 2014 UTC (4 years, 10 months ago) by feinerer
File length: 4664 byte(s)
Diff to previous 1307
Remove text repository, various improvements and bug fixes

Revision 1307 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 25 12:15:51 2014 UTC (4 years, 10 months ago) by feinerer
File length: 4636 byte(s)
Diff to previous 1301
Redesign corpora

Revision 1301 - (view) (download) (annotate) - [select for diffs]
Modified Sat Mar 22 09:52:06 2014 UTC (4 years, 11 months ago) by feinerer
File length: 4605 byte(s)
Diff to previous 1300
Bug fixes to get rid of R CMD check errors

Revision 1300 - (view) (download) (annotate) - [select for diffs]
Modified Fri Mar 21 14:30:05 2014 UTC (4 years, 11 months ago) by feinerer
File length: 4596 byte(s)
Diff to previous 1227
Redesign text documents

This is a major change and causes fallout. Soon to be fixed ...

Revision 1227 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jun 16 08:37:10 2013 UTC (5 years, 8 months ago) by feinerer
File length: 4784 byte(s)
Diff to previous 1220
Use package parallel instead of Rmpi and snow

Revision 1220 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 11 08:37:43 2013 UTC (5 years, 8 months ago) by feinerer
File length: 5029 byte(s)
Diff to previous 1215
Use SnowballC instead of Snowball and RWeka

Revision 1215 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 11 07:42:03 2013 UTC (5 years, 10 months ago) by feinerer
File length: 5079 byte(s)
Diff to previous 1188
Use PCRE UCP for expressions containing \b

Revision 1188 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jul 27 08:47:50 2012 UTC (6 years, 6 months ago) by feinerer
File length: 5073 byte(s)
Diff to previous 1163
Allow more simultaneous (stop)words in removeWords()

Revision 1163 - (view) (download) (annotate) - [select for diffs]
Modified Wed Dec 7 08:27:47 2011 UTC (7 years, 2 months ago) by feinerer
File length: 5060 byte(s)
Diff to previous 1161
Export stripWhitespace.character() function

Revision 1161 - (view) (download) (annotate) - [select for diffs]
Modified Wed Dec 7 06:10:32 2011 UTC (7 years, 2 months ago) by feinerer
File length: 5031 byte(s)
Diff to previous 1159
Add option to removePunctuation() to preserve intra-word dashes

Revision 1159 - (view) (download) (annotate) - [select for diffs]
Modified Tue Dec 6 15:11:45 2011 UTC (7 years, 2 months ago) by feinerer
File length: 5118 byte(s)
Diff to previous 1153
Make termFreq() sensitive to the order of control options

Revision 1153 - (view) (download) (annotate) - [select for diffs]
Modified Thu Nov 17 15:45:31 2011 UTC (7 years, 3 months ago) by feinerer
File length: 5035 byte(s)
Diff to previous 1122
Add SMART stopword list

Revision 1122 - (view) (download) (annotate) - [select for diffs]
Modified Sun Feb 20 07:38:31 2011 UTC (8 years ago) by feinerer
File length: 5026 byte(s)
Diff to previous 1121
Use document language for stemDocument().

Revision 1121 - (view) (download) (annotate) - [select for diffs]
Modified Thu Feb 17 17:13:45 2011 UTC (8 years ago) by feinerer
File length: 5014 byte(s)
Diff to previous 1096
Bug fix. Use language argument for stemDocument().

Revision 1096 - (view) (download) (annotate) - [select for diffs]
Modified Mon Aug 30 12:04:51 2010 UTC (8 years, 5 months ago) by khornik
File length: 5004 byte(s)
Diff to previous 1084
Add commented possible removePunctuation() enhancement.

Revision 1084 - (view) (download) (annotate) - [select for diffs]
Modified Fri Aug 6 21:47:23 2010 UTC (8 years, 6 months ago) by feinerer
File length: 4599 byte(s)
Diff to previous 1039
Remove convert_UTF_8() (use enc2utf8() instead)

Revision 1039 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jan 22 13:01:33 2010 UTC (9 years, 1 month ago) by feinerer
File length: 4718 byte(s)
Diff to previous 1023
Add stemDocument.character().

Revision 1023 - (view) (download) (annotate) - [select for diffs]
Modified Wed Nov 25 06:08:20 2009 UTC (9 years, 2 months ago) by feinerer
File length: 4668 byte(s)
Diff to previous 1018
Add option to termFreq() to remove punctuation characters.

Revision 1018 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 15 15:53:49 2009 UTC (9 years, 3 months ago) by feinerer
File length: 4669 byte(s)
Diff to previous 1013
Fix bug in removeWords(). Refactoring of term-document matrix constructor. Clean up of defunct functions.

Revision 1013 - (view) (download) (annotate) - [select for diffs]
Modified Wed Oct 21 12:34:39 2009 UTC (9 years, 4 months ago) by feinerer
File length: 4691 byte(s)
Diff to previous 1010
Improve regular expressions in removeWords().

Revision 1010 - (view) (download) (annotate) - [select for diffs]
Modified Fri Oct 9 12:48:37 2009 UTC (9 years, 4 months ago) by feinerer
File length: 5002 byte(s)
Diff to previous 1008
Use xmlChildren().

Revision 1008 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 15 18:33:02 2009 UTC (9 years, 5 months ago) by feinerer
File length: 4940 byte(s)
Diff to previous 991
Remove unnecessary arguments.

Revision 991 - (view) (download) (annotate) - [select for diffs]
Modified Sat Sep 5 08:59:23 2009 UTC (9 years, 5 months ago) by feinerer
File length: 4945 byte(s)
Diff to previous 988
Update tm vignette. Minor documentation fixes.

Revision 988 - (view) (download) (annotate) - [select for diffs]
Modified Fri Sep 4 12:27:12 2009 UTC (9 years, 5 months ago) by feinerer
File length: 4986 byte(s)
Diff to previous 987
Update documentation.

Revision 987 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 2 17:54:45 2009 UTC (9 years, 5 months ago) by feinerer
File length: 4970 byte(s)
Diff to previous 986
Update documentation.

Revision 986 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 1 15:33:30 2009 UTC (9 years, 5 months ago) by feinerer
File length: 5015 byte(s)
Diff to previous 985
Further changes due to S3 class system.

Revision 985 - (view) (download) (annotate) - [select for diffs]
Modified Thu Aug 27 18:09:05 2009 UTC (9 years, 5 months ago) by feinerer
File length: 4674 byte(s)
Diff to previous 977
Use S3 instead of S4 class system.

Revision 977 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 9 09:29:41 2009 UTC (9 years, 7 months ago) by feinerer
File length: 3870 byte(s)
Diff to previous 976
Fix.

Revision 976 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 9 09:23:39 2009 UTC (9 years, 7 months ago) by feinerer
File length: 3868 byte(s)
Diff to previous 972
Conversion to UTF-8 encoding.

Revision 972 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jul 3 16:16:59 2009 UTC (9 years, 7 months ago) by feinerer
File length: 3718 byte(s)
Diff to previous 962
Move removeCitation, removeMultipart, and removeSignature to the tau package.

Revision 962 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jun 28 15:52:33 2009 UTC (9 years, 7 months ago) by feinerer
File length: 6781 byte(s)
Diff to previous 959
Fix documentation.

Revision 959 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jun 17 18:22:35 2009 UTC (9 years, 8 months ago) by feinerer
File length: 6774 byte(s)
Diff to previous 952
Fix character(0) handling in stemDoc().

Revision 952 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 18 13:43:01 2009 UTC (9 years, 9 months ago) by feinerer
File length: 6762 byte(s)
Diff to previous 946
Further work on FCorpus integration.

Revision 946 - (view) (download) (annotate) - [select for diffs]
Modified Wed May 13 18:07:35 2009 UTC (9 years, 9 months ago) by feinerer
File length: 6416 byte(s)
Diff to previous 937
A lot of major improvements (see NEWS).

Revision 937 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 16 21:09:49 2009 UTC (9 years, 10 months ago) by feinerer
File length: 6427 byte(s)
Diff to previous 929
Documentation update. Remove some require() calls.

Revision 929 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 9 06:22:21 2009 UTC (9 years, 10 months ago) by feinerer
File length: 6461 byte(s)
Diff to previous 926
Always use Snowball for stemming.

Revision 926 - (view) (download) (annotate) - [select for diffs]
Modified Sat Apr 4 06:50:02 2009 UTC (9 years, 10 months ago) by feinerer
File length: 6618 byte(s)
Diff to previous 886
tmReduce() allows to combine multiple maps into one transformation.

Revision 886 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 29 22:47:34 2009 UTC (10 years ago) by feinerer
File length: 6519 byte(s)
Diff to previous 885
Speed up package loading (Depends -> Suggests).

Revision 885 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 29 09:34:44 2009 UTC (10 years ago) by stefan7th
File length: 6463 byte(s)
Diff to previous 884
moved package to /pkg

Revision 884 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 28 10:24:27 2009 UTC (10 years ago) by stefan7th
Original Path: pkg/tm/R/transform.R
File length: 6463 byte(s)
Diff to previous 869
R-Forge transition completed

Revision 869 - (view) (download) (annotate) - [select for diffs]
Modified Sat Nov 8 09:16:37 2008 UTC (10 years, 3 months ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 6463 byte(s)
Diff to previous 859
Sources now have a Length slot. Knowing the length in advance makes corpus construction a lot faster (~ 8 times faster).

Revision 859 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 9 13:39:52 2008 UTC (10 years, 7 months ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 6445 byte(s)
Diff to previous 858
Improved documentation.

Revision 858 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jul 8 21:08:41 2008 UTC (10 years, 7 months ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 6429 byte(s)
Diff to previous 855
Improved documentation based on comments by David Meyer. More to come.

Revision 855 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 25 15:35:00 2008 UTC (10 years, 8 months ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 6449 byte(s)
Diff to previous 838
Ensure that multiline structures are preserved when using removeWords and stemDoc.

Revision 838 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 23 09:45:06 2008 UTC (10 years, 10 months ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 6022 byte(s)
Diff to previous 823
Changed replaceWords to replacePatterns. Suggested by Christian Buchta.

Revision 823 - (view) (download) (annotate) - [select for diffs]
Modified Wed Feb 6 13:47:59 2008 UTC (11 years ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 6057 byte(s)
Diff to previous 816
Added removeNumbers transformation.

Revision 816 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 24 14:36:41 2008 UTC (11 years ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 5752 byte(s)
Diff to previous 791
Renamed TextDocCol to Corpus, and Corpus to Content.

Revision 791 - (view) (download) (annotate) - [select for diffs]
Modified Sun Oct 21 11:51:42 2007 UTC (11 years, 4 months ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 5737 byte(s)
Diff to previous 780
New tmIntersect filter.

Revision 780 - (view) (download) (annotate) - [select for diffs]
Added Sat Sep 29 13:24:17 2007 UTC (11 years, 4 months ago) by feinerer
Original Path: trunk/tm/R/transform.R
File length: 5508 byte(s)
Added three transformations often used for e-mail analyses.

This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

Sort log by:

R-Forge@R-project.org
ViewVC Help
Powered by ViewVC 1.0.0  
Thanks to:
Vienna University of Economics and Business University of Wisconsin - Madison Powered By FusionForge