Forum: help

RE: Gini index [ Reply ]
By: Achim Zeileis on 2016-03-11 15:02

The source of the difference is not the Gini index but the exhaustive maximally selected search (used by rpart) vs. association tests (used by ctree). The latter have more power to pick up small monotonic changes (increases or decreases in the response proportion) while the former is better at picking up abrupt changes that might also jump back and forth.

If you look at

plot(l ~ x, data = data1)
plot(l ~ z, data = data1)

you see that on average there is no association with either x or z (similar to the XOR problem) and hence the default ctree() does not start splitting.

To overcome this problem in the unbiased setup one can also use maximally selected conditional inference procedures or other maximally selected test statistics. The former is in principle possible but we haven't wired it into partykit, yet. The latter can be done by binomial GLM trees (MOB) for example:

mb <- glmtree(l ~ x + y, data = data1, family = binomial, minsize = 100, prune = "BIC")
plot(mb)

I'm restricting the minimal node size and prune based on the BIC to avoid spurious splits due to separation issues. This tree splits somewhat differently from rpart() but is also able to find the bump. I wouldn't overinterpret the differences of the methods because the setup is somewhat artificial. But, of course, the different strategies can lead to different results...

RE: Gini index [ Reply ]
By: Markus Loecher on 2016-03-11 13:35

[forum:43035]

UnimodalBump.R (9) downloads

Thanks a lot, this is very helpful; with hindsight, I should have known this from the vignette already!
I wonder how I can easily understand why e.g. a Gini minimization would lead to very different splits than conditional inference. In the attached R file I simulate a tightly concentrated "bump" of one class embedded within a uniform background of a 2nd class. rpart partitions space as expected but ctree finds no worthwhile split.
Just thinking out loud, really.

Thanks,
Markus

RE: Gini index [ Reply ]
By: Achim Zeileis on 2016-03-10 11:34

[forum:43023]

For learning trees (i.e., selecting split variables and split points), CTree employs conditional inference techniques that do not include Gini or entropy test statistics. Instead these are t-test or chi-squared test type statistics (conducted in a permutation test framework). Additionally, MOB offers strategies for selection based on log-likelihoods (or similar objective functions).

If you just want to compute summary measures for the tree fitted by CTree, you can of course compute anything you want. For example, you can easily obtain the probability distribution in each node by:

R> library("partykit")
R> ct <- ctree(Species ~ ., data = iris)
R> tapply(iris$Species, predict(ct, type = "node"), function(y) prop.table(table(y)))
$`2`
y
setosa versicolor virginica
1 0 0

$`5`
y
setosa versicolor virginica
0.00000000 0.97826087 0.02173913

$`6`
y
setosa versicolor virginica
0.0 0.5 0.5

$`7`
y
setosa versicolor virginica
0.00000000 0.02173913 0.97826087

Based on this you can compute misclassification rates, Gini indexes, entropies etc.

Gini index [ Reply ] By: Markus Loecher on 2016-03-10 10:49	[forum:43022]
I am quite afraid that this question reveals some deeper lack of understanding of ctrees but is there a way to specify traditional leaf impurity functions (Gini, cross entropy) for simple classification problems in ctree() ? Thanks! Markus