Coder Social home page Coder Social logo

Comments (13)

mnwright avatar mnwright commented on July 24, 2024

In the current implementation you have to code the characters as factors to consider them unordered (Internally we check for is.factor & !is.ordered). In foo2 the characters are converted to ordered factors and there should be no computational differences to foo3 and foo4.

You are right to be confused, characters should be considered as unordered if respect.unordered.factors = TRUE (as in foo2).

Any objections?

from ranger.

tchakravarty avatar tchakravarty commented on July 24, 2024

Absolutely not, and thanks for the quick reply. I would say that in case I have a character feature and I have set respect.unordered.factors=TRUE, I would expect the coercion to factor to be unordered as well. Actually, I am just going through your excellent vignette , and as long as it is documented, it shouldn't matter either way.

from ranger.

tchakravarty avatar tchakravarty commented on July 24, 2024

@mnwright Adding some further information to this discussion -- let me know if this requires an issue of its own. To summarize the discussion so far, a resolution to this issue will involve changing the handling of coercion of character variables to factor (currently coreced to ordered factor) when respect.unordered.factors=TRUE is set.

However, I have run into a deeper issue. When I do have a some unordered factor variables in the data (they are declared as such), and pass them to ranger, ranger hangs and does not appear to be doing anything.

Here is a reproducible example:

library(wakefield)
library(ranger)

# using wakefield
sample_size = 1e4
df_foo = r_data_frame(
  n = sample_size,
  ID = id,
  y = dummy,
  r_series(wakefield::r_sample_factor, j = 10, n = sample_size, name = "Factor")
)

# create the formula object
formula_ranger = as.formula(paste0("as.factor(y) ~", paste0("Factor_", 1:10, collapse = "+")))

# run ranger
# HANGS!
ranger_foo = ranger(formula = formula_ranger, 
                    data = df_foo, 
                    num.trees = 100, 
                    mtry = 4,
                    write.forest = TRUE, 
                    respect.unordered.factors = TRUE, 
                    verbose = TRUE)

What is more surprising is that ranger works fine if I explicitly convert the factor variables to dummies before passing the data to ranger:

# create the dummies by hand
mm_foo = cbind.data.frame(y = df_foo$y, model.matrix(object = formula_ranger, data = df_foo))

# run ranger
ranger_foo = ranger(formula = NULL, 
                    data = mm_foo,
                    dependent.variable.name = "y",
                    num.trees = 100, 
                    mtry = 4,
                    write.forest = TRUE, 
                    respect.unordered.factors = TRUE,
                    verbose = TRUE, 
                    seed = 1234,
                    classification = TRUE)

from ranger.

mnwright avatar mnwright commented on July 24, 2024

It doesn't hang, it's just computing for ages (the unordered factor mode is not optimised and very slow for many factor levels). However, if you reduce the sample size, another error occurs because of the as.factor() in the formula. This bug is fixed now.

As suggested, characters are now considered as unordered if respect.unordered.factors = TRUE.

from ranger.

tchakravarty avatar tchakravarty commented on July 24, 2024

Martin,

So just to confirm -- it is always better to pre-compute the dummy variable
encoding for factor variables and pass the model.matrix as above? Would
the two models always be equivalent?

I will try to set up a more reasonably sized example to test this, but I am
not sure how I would handle the randomness in the two cases.

T

from ranger.

mnwright avatar mnwright commented on July 24, 2024

Sorry for the delay! The models are not equivalent. I guess the performance of the models depends on the data. It would be interesting to compare the performance of the 3 options (with tuned mtry value).

from ranger.

tchakravarty avatar tchakravarty commented on July 24, 2024

@mnwright Any suggestions on the simulation setups to use to test this difference?

from ranger.

mnwright avatar mnwright commented on July 24, 2024

I would start with some multinomial distributed features with, say, 4, 8, 12 and 16 categories. The effects could be simulated by a tree. It's important to tune the mtry value because the dummy approach will increase the number of features. For evaluation I would go for training and testing data instead of the OOB error.

Maybe there is even a real dataset you could use?

from ranger.

PhilippPro avatar PhilippPro commented on July 24, 2024

Two more questions regarding the order:
How do you determine the order of the character variable?

In "The elements of Statistical Learning" in chapter 9.2.4 (http://statweb.stanford.edu/~tibs/ElemStatLearn/) they suggest to order unordered variables in case of binary outcomes, by there appearances in outcome class 1. Similarly in the regression case (and possibly this could be also applied in the multiclass case by some similarity approach).
Is this implemented in ranger? Maybe this would be a possibility to speed up computation.

from ranger.

mnwright avatar mnwright commented on July 24, 2024

For unordered factors the internal coding of R is used. I think the levels are ordered alphabetically and numbered starting from 1.

Thanks for the hint on the book. Very interesting that we can get the best split without trying all unordered splits! I will also look into the approximations for multicategory outcomes.

from ranger.

mnwright avatar mnwright commented on July 24, 2024

The approach described by Hastie et al. is now added and used by default. I'll close here.

from ranger.

berndbischl avatar berndbischl commented on July 24, 2024

@mnwright

I think you should reopen the issue here. If you want to I would offer to help in the discussion. I do think you made some progress before.

from ranger.

berndbischl avatar berndbischl commented on July 24, 2024

here are my 2cents on this:

  • ranger should IMHO really support clever and faster splitting for unordered factors. this is important, as not doing this is really too slow and stupid. and the limitation on < 20 or 40 or 60 levels sucks, too.
  • IMHO you already made a lot of progress in the last try
  • for MSE regression and binary classification you have to resort the levels for each new tried split. @PhilippPro has linked to the details. IIRC it is provable that for this the true best split is then contained in the linear search over this order.
  • this faster implementation should always be used for respect.unordered.factors = TRUE, in the cases that i have outlined
  • for all other cases (multiclass or survival or whatever) i currently dont know a better way for unordered factors. but who cares? the faster impl. for the 2 cases discussed here already improves ranger A LOT.
  • it should be properly doced what happens. not hard.
  • respect.unordered.factors = TRUE should IMHO always be the default

from ranger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.