SCM

Forum: help

Monitor Forum | Start New Thread Start New Thread
RE: Issue with as.party [ Reply ]
By: Achim Zeileis on 2018-05-21 15:57
[forum:45983]
I wouldn't re-encode the entire column, just its levels, e.g.,

levels(datos.rrhh$area) <- stri_enc_toascii(levels(datos.rrhh$area))

or you just transliterate the non-ASCII levels manually (which is what I did).

For a warning in partykit I would need to know what the source of the problem is exactly. It's not the non-ASCII labels per se! As I wrote before: If I properly declare the "latin1" encoding everything works for me out of the box. But I'm running a standard UTF-8 locale and maybe you use some less standard locale or your Weka uses a mismatching locale or ... It's hard for me to say where this goes wrong on your end.

Re: fairly common issue. In which other forum did this issue occur?

RE: Issue with as.party [ Reply ]
By: J M on 2018-05-21 15:50
[forum:45982]
Thanks Achim! I was able to solve the issue by forcing the area column to ASCII with this additional line of code:

library(stringi)
datos.rrhh[,c(9)] <- as.factor(stri_enc_toascii(datos.rrhh[,c(9)]))

Is there any way to add an enhancement to the partykit to at least warn on the factor variable that is causing trouble due to locale system encoding? That would help others using non ASCII systems to debug.

I will post this response on other forums, because I noticed this is a fairly common issue.

RE: Issue with as.party [ Reply ]
By: Achim Zeileis on 2018-05-21 15:19
[forum:45981]
Certainly some mismatch between locale/encoding or something like that. Maybe Weka is not set up using the right locale...

I would recommend using ASCII factor levels for area instead. If you need the latin1-encoded labels later on for plotting you can still change levels(x$data$area) after the conversion with as.party().

RE: Issue with as.party [ Reply ]
By: J M on 2018-05-21 15:10
[forum:45980]
Thanks Achim! I don´t know why, but even with the code you just sent and declaring the encoding within the read.csv function I am still receiving the same issue.

RE: Issue with as.party [ Reply ]
By: Achim Zeileis on 2018-05-21 13:04
[forum:45978]
Thanks for the replication code. It turns out that this has nothing to do with the tree depth -- other than that on this specific data set the problem does not occur when "area" is not selected for splitting and this only occurs rather late in the tree.

The problem occurs due to the following:

- You read your CSV which is in ISO-8859-1 encoding, also known as latin1, without declaring the encoding. Thus, you end up with a version of the data where the non-ASCII levels of the "area" variable are not properly encoding.

- When passing the data to Weka in Java and reading the result back, you receive area labels that have been reencoded in a different way. Hence, matching the labels in the data and in the Weka tree does not work.

- To avoid the problem you can either avoid non-ASCII labels which is usually the most robust solution where you have to pay less attention what exactly you are doin.

- Or you properly declare the encoding of your CSV. The following works for me:

datos.rrhh <- read.csv("Datos_RRHH.csv",header=T,sep=',', encoding = "latin1")
resultJ48 <- J48(se_fue~., datos.rrhh, control = Weka_control(M = 2, C = 0.5))
x <- as.party(resultJ48)

RE: Issue with as.party [ Reply ]
By: J M on 2018-05-21 03:09
[forum:45971]

Datos_RRHH.csv (13) downloads
Thanks Achim for following up!

##Start of script

library(partykit)
library(RWeka)
datos.rrhh <- read.csv("Datos_RRHH.csv",header=T,sep=',')

resultJ48 <- J48(se_fue~., datos.rrhh, control = Weka_control(M = 2, C = 0.5))

x <- as.party(resultJ48)

length(x)

##End of script

RE: Issue with as.party [ Reply ]
By: J M on 2018-05-21 03:05
[forum:45970]

Datos_RRHH.csv (20) downloads
Thanks Achim for following up!

it´s important to note that this issue only happens for overfitted trees, while trees beneath 100 (i guess) this issue is not happening.

Below a minimal example + dataset attached:

//Beginning of script//

library(partykit)
library(RWeka)
datos.rrhh <- read.csv("Datos_RRHH.csv",header=T,sep=',')

resultJ48 <- J48(se_fue~., datos.rrhh, control = Weka_control(M = 2, C = 0.5))

x <- as.party(resultJ48)

length(x)

//End of Script//

Thanks!

RE: Issue with as.party [ Reply ]
By: Achim Zeileis on 2018-05-18 22:14
[forum:45966]
Could you please provide a _minimal_ and _reproducible_ example?

My first guess would be that this is related to a categorical variable being appropriately encoded as a factor but I couldn't verify that.

Issue with as.party [ Reply ]
By: J M on 2018-05-18 15:20
[forum:45965]
Hi all, I am receiving the following error while trying to fit deep weka trees "as party" objects.

Error in weka_tree_split(i) : all(sapply(split, tail, 1) %in% mf_levels[[var_id]]) is not TRUE

When I increase the minNumobj on the leaves, the as.party function works as a result (I believe) of the tree reducing its depth/size.

Any clues how to fix that?

Thanks to:
Vienna University of Economics and Business Powered By FusionForge