I am trying to understand how I should enter unordered factors in <code class="notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to correctly work with unordered factors in ranger about ranger HOT 13 CLOSED

imbs-hl commented on July 24, 2024

How to correctly work with unordered factors in ranger

from ranger.

Comments (13)

mnwright commented on July 24, 2024

In the current implementation you have to code the characters as factors to consider them unordered (Internally we check for is.factor & !is.ordered). In foo2 the characters are converted to ordered factors and there should be no computational differences to foo3 and foo4.

You are right to be confused, characters should be considered as unordered if respect.unordered.factors = TRUE (as in foo2).

Any objections?

from ranger.

tchakravarty commented on July 24, 2024

Absolutely not, and thanks for the quick reply. I would say that in case I have a character feature and I have set respect.unordered.factors=TRUE, I would expect the coercion to factor to be unordered as well. Actually, I am just going through your excellent vignette , and as long as it is documented, it shouldn't matter either way.

from ranger.

tchakravarty commented on July 24, 2024

@mnwright Adding some further information to this discussion -- let me know if this requires an issue of its own. To summarize the discussion so far, a resolution to this issue will involve changing the handling of coercion of character variables to factor (currently coreced to ordered factor) when respect.unordered.factors=TRUE is set.

However, I have run into a deeper issue. When I do have a some unordered factor variables in the data (they are declared as such), and pass them to ranger, ranger hangs and does not appear to be doing anything.

Here is a reproducible example:

library(wakefield)
library(ranger)

# using wakefield
sample_size = 1e4
df_foo = r_data_frame(
  n = sample_size,
  ID = id,
  y = dummy,
  r_series(wakefield::r_sample_factor, j = 10, n = sample_size, name = "Factor")
)

# create the formula object
formula_ranger = as.formula(paste0("as.factor(y) ~", paste0("Factor_", 1:10, collapse = "+")))

# run ranger
# HANGS!
ranger_foo = ranger(formula = formula_ranger, 
                    data = df_foo, 
                    num.trees = 100, 
                    mtry = 4,
                    write.forest = TRUE, 
                    respect.unordered.factors = TRUE, 
                    verbose = TRUE)

What is more surprising is that ranger works fine if I explicitly convert the factor variables to dummies before passing the data to ranger:

# create the dummies by hand
mm_foo = cbind.data.frame(y = df_foo$y, model.matrix(object = formula_ranger, data = df_foo))

# run ranger
ranger_foo = ranger(formula = NULL, 
                    data = mm_foo,
                    dependent.variable.name = "y",
                    num.trees = 100, 
                    mtry = 4,
                    write.forest = TRUE, 
                    respect.unordered.factors = TRUE,
                    verbose = TRUE, 
                    seed = 1234,
                    classification = TRUE)

from ranger.

mnwright commented on July 24, 2024

It doesn't hang, it's just computing for ages (the unordered factor mode is not optimised and very slow for many factor levels). However, if you reduce the sample size, another error occurs because of the as.factor() in the formula. This bug is fixed now.

As suggested, characters are now considered as unordered if respect.unordered.factors = TRUE.

from ranger.

tchakravarty commented on July 24, 2024

Martin,

So just to confirm -- it is always better to pre-compute the dummy variable
encoding for factor variables and pass the model.matrix as above? Would
the two models always be equivalent?

I will try to set up a more reasonably sized example to test this, but I am
not sure how I would handle the randomness in the two cases.

from ranger.

mnwright commented on July 24, 2024

Sorry for the delay! The models are not equivalent. I guess the performance of the models depends on the data. It would be interesting to compare the performance of the 3 options (with tuned mtry value).

from ranger.

tchakravarty commented on July 24, 2024

@mnwright Any suggestions on the simulation setups to use to test this difference?

from ranger.

mnwright commented on July 24, 2024

I would start with some multinomial distributed features with, say, 4, 8, 12 and 16 categories. The effects could be simulated by a tree. It's important to tune the mtry value because the dummy approach will increase the number of features. For evaluation I would go for training and testing data instead of the OOB error.

Maybe there is even a real dataset you could use?

from ranger.

PhilippPro commented on July 24, 2024

Two more questions regarding the order:
How do you determine the order of the character variable?

In "The elements of Statistical Learning" in chapter 9.2.4 (http://statweb.stanford.edu/~tibs/ElemStatLearn/) they suggest to order unordered variables in case of binary outcomes, by there appearances in outcome class 1. Similarly in the regression case (and possibly this could be also applied in the multiclass case by some similarity approach).
Is this implemented in ranger? Maybe this would be a possibility to speed up computation.

from ranger.

mnwright commented on July 24, 2024

For unordered factors the internal coding of R is used. I think the levels are ordered alphabetically and numbered starting from 1.

Thanks for the hint on the book. Very interesting that we can get the best split without trying all unordered splits! I will also look into the approximations for multicategory outcomes.

from ranger.

mnwright commented on July 24, 2024

The approach described by Hastie et al. is now added ~~and used by default~~. I'll close here.

from ranger.

berndbischl commented on July 24, 2024

@mnwright

I think you should reopen the issue here. If you want to I would offer to help in the discussion. I do think you made some progress before.

from ranger.

berndbischl commented on July 24, 2024

here are my 2cents on this:

ranger should IMHO really support clever and faster splitting for unordered factors. this is important, as not doing this is really too slow and stupid. and the limitation on < 20 or 40 or 60 levels sucks, too.
IMHO you already made a lot of progress in the last try
for MSE regression and binary classification you have to resort the levels for each new tried split. @PhilippPro has linked to the details. IIRC it is provable that for this the true best split is then contained in the linear search over this order.
this faster implementation should always be used for respect.unordered.factors = TRUE, in the cases that i have outlined
for all other cases (multiclass or survival or whatever) i currently dont know a better way for unordered factors. but who cares? the faster impl. for the 2 cases discussed here already improves ranger A LOT.
it should be properly doced what happens. not hard.
respect.unordered.factors = TRUE should IMHO always be the default

from ranger.

How to correctly work with unordered factors in ranger about ranger HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent