Comments (13)
In the current implementation you have to code the characters as factors to consider them unordered (Internally we check for is.factor & !is.ordered). In foo2 the characters are converted to ordered factors and there should be no computational differences to foo3 and foo4.
You are right to be confused, characters should be considered as unordered if respect.unordered.factors = TRUE (as in foo2).
Any objections?
from ranger.
Absolutely not, and thanks for the quick reply. I would say that in case I have a character feature and I have set respect.unordered.factors=TRUE
, I would expect the coercion to factor to be unordered as well. Actually, I am just going through your excellent vignette , and as long as it is documented, it shouldn't matter either way.
from ranger.
@mnwright Adding some further information to this discussion -- let me know if this requires an issue of its own. To summarize the discussion so far, a resolution to this issue will involve changing the handling of coercion of character variables to factor (currently coreced to ordered factor) when respect.unordered.factors=TRUE
is set.
However, I have run into a deeper issue. When I do have a some unordered factor variables in the data (they are declared as such), and pass them to ranger, ranger hangs and does not appear to be doing anything.
Here is a reproducible example:
library(wakefield)
library(ranger)
# using wakefield
sample_size = 1e4
df_foo = r_data_frame(
n = sample_size,
ID = id,
y = dummy,
r_series(wakefield::r_sample_factor, j = 10, n = sample_size, name = "Factor")
)
# create the formula object
formula_ranger = as.formula(paste0("as.factor(y) ~", paste0("Factor_", 1:10, collapse = "+")))
# run ranger
# HANGS!
ranger_foo = ranger(formula = formula_ranger,
data = df_foo,
num.trees = 100,
mtry = 4,
write.forest = TRUE,
respect.unordered.factors = TRUE,
verbose = TRUE)
What is more surprising is that ranger
works fine if I explicitly convert the factor variables to dummies before passing the data to ranger
:
# create the dummies by hand
mm_foo = cbind.data.frame(y = df_foo$y, model.matrix(object = formula_ranger, data = df_foo))
# run ranger
ranger_foo = ranger(formula = NULL,
data = mm_foo,
dependent.variable.name = "y",
num.trees = 100,
mtry = 4,
write.forest = TRUE,
respect.unordered.factors = TRUE,
verbose = TRUE,
seed = 1234,
classification = TRUE)
from ranger.
It doesn't hang, it's just computing for ages (the unordered factor mode is not optimised and very slow for many factor levels). However, if you reduce the sample size, another error occurs because of the as.factor() in the formula. This bug is fixed now.
As suggested, characters are now considered as unordered if respect.unordered.factors = TRUE.
from ranger.
Martin,
So just to confirm -- it is always better to pre-compute the dummy variable
encoding for factor variables and pass the model.matrix
as above? Would
the two models always be equivalent?
I will try to set up a more reasonably sized example to test this, but I am
not sure how I would handle the randomness in the two cases.
T
from ranger.
Sorry for the delay! The models are not equivalent. I guess the performance of the models depends on the data. It would be interesting to compare the performance of the 3 options (with tuned mtry value).
from ranger.
@mnwright Any suggestions on the simulation setups to use to test this difference?
from ranger.
I would start with some multinomial distributed features with, say, 4, 8, 12 and 16 categories. The effects could be simulated by a tree. It's important to tune the mtry value because the dummy approach will increase the number of features. For evaluation I would go for training and testing data instead of the OOB error.
Maybe there is even a real dataset you could use?
from ranger.
Two more questions regarding the order:
How do you determine the order of the character variable?
In "The elements of Statistical Learning" in chapter 9.2.4 (http://statweb.stanford.edu/~tibs/ElemStatLearn/) they suggest to order unordered variables in case of binary outcomes, by there appearances in outcome class 1. Similarly in the regression case (and possibly this could be also applied in the multiclass case by some similarity approach).
Is this implemented in ranger? Maybe this would be a possibility to speed up computation.
from ranger.
For unordered factors the internal coding of R is used. I think the levels are ordered alphabetically and numbered starting from 1.
Thanks for the hint on the book. Very interesting that we can get the best split without trying all unordered splits! I will also look into the approximations for multicategory outcomes.
from ranger.
The approach described by Hastie et al. is now added and used by default. I'll close here.
from ranger.
I think you should reopen the issue here. If you want to I would offer to help in the discussion. I do think you made some progress before.
from ranger.
here are my 2cents on this:
- ranger should IMHO really support clever and faster splitting for unordered factors. this is important, as not doing this is really too slow and stupid. and the limitation on < 20 or 40 or 60 levels sucks, too.
- IMHO you already made a lot of progress in the last try
- for MSE regression and binary classification you have to resort the levels for each new tried split. @PhilippPro has linked to the details. IIRC it is provable that for this the true best split is then contained in the linear search over this order.
- this faster implementation should always be used for respect.unordered.factors = TRUE, in the cases that i have outlined
- for all other cases (multiclass or survival or whatever) i currently dont know a better way for unordered factors. but who cares? the faster impl. for the 2 cases discussed here already improves ranger A LOT.
- it should be properly doced what happens. not hard.
- respect.unordered.factors = TRUE should IMHO always be the default
from ranger.
Related Issues (20)
- Define a custom loss function HOT 2
- R squared in stand alone version
- Error: Missing data in columns: predictor1, predictor2.... HOT 2
- Incorrect bounds check in beta log probability HOT 2
- Different predictions based on variable importance setting HOT 4
- `ranger()` finished, but `rsession` is still running HOT 8
- Trying to spatially predict(type="se")$se and the output has many null values and what look like spatial offsets? HOT 6
- compilation failed for package βrangerβ HOT 34
- ranger-cli fails to build HOT 1
- treeInfo() fails for probability trees with non-factor response HOT 1
- Poor handling of character string predictors HOT 6
- Increasing mtry crashes ranger fit HOT 3
- make fails => cannot compile C++ source on Mac HOT 2
- Error updating the package HOT 14
- warnings generated running 'Understanding random forests with randomForestExplainer' code HOT 1
- num.threads causing crashes inside caret recursive feature elimination wrapper HOT 1
- Results from importance_pvalues() differ despite setting seed HOT 1
- Decision Tree Build HOT 2
- Random forest prediction intervals using the out-of-bag predictions errors. HOT 2
- Is there a way to fit an isolation forest using ranger? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ranger.