evovest / evotrees.jl Goto Github PK
View Code? Open in Web Editor NEWBoosted trees in Julia
Home Page: https://evovest.github.io/EvoTrees.jl/dev/
License: Apache License 2.0
Boosted trees in Julia
Home Page: https://evovest.github.io/EvoTrees.jl/dev/
License: Apache License 2.0
Happy to contribute if there is a desire to support Tables.jl
I was wondering if we can have a feature where the learning rate is reduced by some percent (user-defined parameter) once the eval metric increases by some amount.
So instead of early_stopping after 20 rounds, the learning rate might be reduced by 90% instead.
This should allow the model to start learning again.
The idea is to generate more trees in the low loss space of models.
Consistently reducing the learning rate should allow us to move more slowly in this space and harvest a lot more models to average over.
@jeremiedb You may want to look at this Julia discourse thread. Sorry for opening here - I do not know your Discourse handle.
Add softmax loss function
I have been able to train an EvoTreeRegressor with the default parameters successfully. When I try to increase the max_depth
parameter beyond 10 suddenly my memory usage spikes and Julia dies.
Here's a snippet from the REPL
julia> evo = EvoTreeRegressor(max_depth=15, rng=42)
EvoTreeRegressor(
loss = EvoTrees.Linear(),
nrounds = 10,
λ = 0.0,
γ = 0.0,
η = 0.1,
max_depth = 15,
min_weight = 1.0,
rowsample = 1.0,
colsample = 1.0,
nbins = 64,
α = 0.5,
metric = :mse,
rng = MersenneTwister(42),
device = "cpu")
julia> mach = machine(evo, Xtrain, CDOM_train)
Machine{EvoTreeRegressor{Float64,…},…} trained 0 times; caches data
args:
1: Source @710 ⏎ `Table{AbstractVector{Continuous}}`
2: Source @134 ⏎ `AbstractVector{Continuous}`
julia> fit!(mach, verbosity=2)
[ Info: Training Machine{EvoTreeRegressor{Float64,…},…}.
Process julia killed
I see that the new MLJ models only expose a seed rather than the RNG. Is there a reason for this restriction?
To generate multiple learning curves for an MLJ model, one needs access to the RNG.
For convenience, setting the RNG field to an integer i
could instantiate a MersenneTwister(i)
. (In MLJ, most rng
fields or key-words can be integers or AbstractRNG
)
Hi, I've noticed that with the recent PR that the feature importance function no longer allows you to enter feature names as an argument. Consider the example
using EvoTrees
using Statistics
using StatsBase: sample
# prepare a dataset
features = rand(Int(1.25e6), 100)
# features = rand(100, 10)
X = features
Y = rand(size(X, 1))
𝑖 = collect(1:size(X, 1))
# train-eval split
𝑖_sample = sample(𝑖, size(𝑖, 1), replace=false)
train_size = 0.8
𝑖_train = 𝑖_sample[1:floor(Int, train_size * size(𝑖, 1))]
𝑖_eval = 𝑖_sample[floor(Int, train_size * size(𝑖, 1))+1:end]
x_train, x_eval = X[𝑖_train, :], X[𝑖_eval, :]
y_train, y_eval = Y[𝑖_train], Y[𝑖_eval]
config = EvoTreeClassifier(
loss=:linear,
nrounds=100,
nbins=100,
lambda=0.5,
gamma=0.1,
eta=0.1,
max_depth=6,
min_weight=1.0,
rowsample=0.5,
colsample=1.0)
model = fit_evotree(config; x_train = training_features, y_train = training_labels, x_eval = validation_features, y_eval = validation_labels, print_every_n = 1)
display(importance(model))
which gives an output
4-element Vector{Pair{String, Float64}}:
"feat_3" => 0.26565039212451985
"feat_4" => 0.2589711676696925
"feat_1" => 0.24700503862705744
"feat_2" => 0.22837340157873026
Could you please correct this to add another method that allows feature names to be entered into importance
as it was previously? Thanks
Let us consider the real ML problem (I cannot show the feature names but each row is a separate feature) with 600k observations:
As we can see we have many categorical features with approximately 1500 categories for each feature on average.
Let us approximate real features with random one-hot encoded features represented by sparse matrices and try to apply both XGBoost and EvoTrees:
As we can see XGBoost handles very sparse data very well while EvoTrees allocates too much memory and crashes with OOM error.
The notebook is attached
julia_evotrees.zip
The code to reproduce as the plain script:
# -*- coding: utf-8 -*-
# ---
# jupyter:
# jupytext:
# formats: ipynb,jl:light
# text_representation:
# extension: .jl
# format_name: light
# format_version: '1.5'
# jupytext_version: 1.11.1
# kernelspec:
# display_name: Julia 1.6.0-rc3
# language: julia
# name: julia-1.6
# ---
# + tags=[]
using Pkg; Pkg.activate(".");
# + tags=[]
Pkg.add(["Statistics", "StatsBase", "Revise", "EvoTrees", "BenchmarkTools", "CUDA", "SparseArrays"]);
# + tags=[]
using Statistics
using StatsBase
using XGBoost
using Revise
using EvoTrees
using BenchmarkTools
using CUDA
using SparseArrays
# + tags=[]
nrounds = 200;
nthread = Threads.nthreads();
# + tags=[]
# xgboost aprams
params_xgb = ["max_depth" => 5,
"eta" => 0.05,
"objective" => "reg:squarederror",
"print_every_n" => 5,
"subsample" => 0.5,
"colsample_bytree" => 0.5,
"tree_method" => "hist",
"max_bin" => 64]
metrics = ["rmse"]
# + tags=[]
# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
loss=:linear, metric=:mse,
nrounds=nrounds, α=0.5,
λ=0.0, γ=0.0, η=0.05,
max_depth=6, min_weight=1.0,
rowsample=0.5, colsample=0.5, nbins=64)
# + tags=[]
function random_select(n::Int64,K::Int64)
@assert 0<=n<=K
sample=Vector{Int64}(undef, n)
t=Int64(0)
m=Int64(0)
while m<n
if (K-t)*rand()>=n-m
t+=1
else
m+=1
sample[m]=t
t+=1
end
end
sample
end
# + tags=[]
function create_sparseMatrix(n::Int64,N::Int64,M::Int64)
@assert (0<=N)&&(0<=M)
@assert 0<=n<=N*M
nonZero = random_select(n,N*M)
# column major: k=i+j*N
I = map(k->mod(k,N),nonZero)
J = map(k->div(k,N),nonZero)
sparse(I.+1,J.+1,ones(n),N,M)
end
# -
sparseones(N,M,K) = sparse(
(x->(first.(x).+1,last.(x).+1))(divrem.(sample(0:N*M-1,K,replace=false),M))...,
ones(K),N,M
)
# + tags=[]
nobs = Int(600000)
num_feat = Int(50);
n_cats_per_feature = Int(1500);
@info "testing with: $nobs observations | $num_feat features."
# + tags=[]
X = sparseones(nobs, num_feat*n_cats_per_feature, Int64(nobs*num_feat));
# + tags=[]
Y = rand(size(X, 1));
# + tags=[]
@info "xgboost train:"
@time m_xgb = xgboost(X, nrounds, label=Y, param=params_xgb, metrics=metrics, nthread=nthread, silent=1);
# @btime xgboost($X, $nrounds, label=$Y, param=$params_xgb, metrics=$metrics, silent=1);
# + tags=[]
@info "xgboost predict:"
@time pred_xgb = XGBoost.predict(m_xgb, X);
# @btime XGBoost.predict($m_xgb, $X);
# + tags=[]
@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
# @btime fit_evotree($params_evo, $X, $Y);
# -
@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
#@btime EvoTrees.predict($m_evo, $X);
# + tags=[]
CUDA.allowscalar(false)
@info "evotrees train GPU:"
params_evo.device = "gpu"
@time m_evo_gpu = fit_evotree(params_evo, X, Y);
#@btime fit_evotree($params_evo, $X, $Y);
# + tags=[]
@info "evotrees predict GPU:"
@time pred_evo = EvoTrees.predict(m_evo_gpu, X);
#@btime EvoTrees.predict($m_evo_gpu, $X);
# -
Hi and thanks for this package!
I noticed a bug when I was running the following code. I have a black screen during 5min, then I have the attached screen. We tried the same code under a Linux distribution (also using VS Code) and everything went fine, so I guess it's a W10-linked issue! Please feel free to ask if you need more info
using EvoTrees
println(stdout, "Tuning EvoTrees...")
MLJ.@load EvoTreeClassifier
evotree_model_base = EvoTreeClassifier(nrounds=100)
# Step 1 of our tuning
evotree_step1 = [ range(evotree_model_base, :max_depth; lower=3, upper=10, unit=1),
range(evotree_model_base, :min_weight; lower=1, upper=6, unit=1) ]
tuned_evotree_model = TunedModel(model=evotree_model_base,
tuning=rnd_search,
resampling=StrCV,
repeats=3,
# n=5,
range=evotree_step1,
measure=probabilistic_accuracy,
acceleration=CPUThreads()
)
tuned_evotree = machine(tuned_evotree_model, X, y)
println(stdout, "EvoTrees: Step 1")
@time fit!(tuned_evotree, verbosity=1, rows=train_indexes)
savefig(plot(tuned_evotree), "HeatMap_EvoTrees_Step1.png")
evotree_model = fitted_params(tuned_evotree).best_model
Is there builtin API for online data adding? Or plan to support the MLJ supports_online trait?
Support non one-hot encoded categorical features: features carrying item info as an Int (1 to N levels).
Consider change from Matrix to DataFrames input structure to handle mixed feature.
Consider supporting mix of input structures: DataFrames + SparseMatrix for efficieent handling of mixture of dense (continuous and categorical) and sparse features.
First, thanks for the package - super user-friendly and blazing fast.
I'm trying to plot the tree as in the tutorial, and it doesn't seem to work. If I do
plot(model)
It gives me a single box with the bias (which is cool). But if try something like
plot(model, 2)
what I get is a stack trace (which is less cool):
julia> plot(model, 3)
ERROR: type UnionAll has no field layout
Stacktrace:
[1] getproperty(x::Type, f::Symbol)
@ Base ./Base.jl:28
[2] macro expansion
@ ~/.julia/packages/EvoTrees/pYJaO/src/plot.jl:108 [inlined]
[3] apply_recipe(plotattributes::AbstractDict{Symbol, Any}, model::EvoTrees.GBTree, n::Any, var_names::Any)
@ EvoTrees ~/.julia/packages/RecipesBase/3fzVq/src/RecipesBase.jl:283
[4] apply_recipe(plotattributes::AbstractDict{Symbol, Any}, model::EvoTrees.GBTree, n::Any)
@ EvoTrees ~/.julia/packages/RecipesBase/3fzVq/src/RecipesBase.jl:277
[5] _process_userrecipes!(plt::Any, plotattributes::Any, args::Any)
@ RecipesPipeline ~/.julia/packages/RecipesPipeline/Bxu2O/src/user_recipe.jl:36
[6] recipe_pipeline!(plt::Any, plotattributes::Any, args::Any)
@ RecipesPipeline ~/.julia/packages/RecipesPipeline/Bxu2O/src/RecipesPipeline.jl:70
[7] _plot!(plt::Plots.Plot, plotattributes::Any, args::Any)
@ Plots ~/.julia/packages/Plots/5kcBO/src/plot.jl:208
[8] plot(::Any, ::Vararg{Any, N} where N; kw::Any)
@ Plots ~/.julia/packages/Plots/5kcBO/src/plot.jl:91
[9] plot(::Any, ::Any)
@ Plots ~/.julia/packages/Plots/5kcBO/src/plot.jl:85
[10] top-level scope
@ REPL[134]:1
The issue is in the call to Buchheim
, and if I replace it by
tree_layout = length(adj) == 1 ? [[0.0,0.0]] : NetworkLayout.Buchheim()(adj)
it works - I can submit a PR if you want.
I just ran the 2 examples in readme, only difference is device parameter. The MLJ interface doesn' t use gpu when fit!
while the internal API version uses the gpu well.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
i found evotree is only pure julia ,gradient boost package. I appriciate it .
how to do hyper parameter optimization ?
you have compared run time of code , what happed to quality of results?
Please add example with results comparison with any real dataset with hyperparameter optimization.
Thank you
is the way to go "fit" for 1 round manually ?....
I took the code from the README. If you use GPU then it doesn't work after is disallow indexing.
using EvoTrees
using EvoTrees: sigmoid, logit
# prepare a dataset
features = rand(10000*20) .* 20 .- 10
X = reshape(features, (10000, 20))
Y = sin.(features) .* 0.5 .+ 0.5
Y = logit(Y) + randn(size(Y))
Y = sigmoid(Y)
i = collect(1:size(X, 1))
# train-eval split
i_sample = sample(i, size(i, 1), replace = false)
train_size = 0.8
i_train = i_sample[1:floor(Int, train_size * size(i, 1))]
i_eval = i_sample[floor(Int, train_size * size(i, 1))+1:end]
X_train, X_eval = X[i_train, :], X[i_eval, :]
Y_train, Y_eval = Y[i_train], Y[i_eval]
params1 = EvoTreeRegressor(
loss=:linear, metric=:mse,
nrounds=100, nbins = 100,
λ = 0.5, γ=0.1, η=0.1,
max_depth = 6, min_weight = 1.0,
rowsample=0.5, colsample=1.0, device="gpu")
using CUDA
CUDA.allowscalar(false)
@time model = fit_evotree(params1, X_train, Y_train, X_eval = X_eval, Y_eval = Y_eval, print_every_n = 25)
Hi and thanks for this package :)
I was trying to plot my decision tree using the plot(model, n)
method outlined at the end of the Readme tutorial, but got the following error:
ERROR: Cannot convert EvoTrees.GBTree{1,Float32,Int64} to series data for plotting.
I came back to this repo and copy-pasted all the code from the Readme file, and still got the same error, so it should be fairly easy to replicate.
I am using EvoTrees v0.5.3, Plots v1.6.12 and Julia v1.4.2 on VSCode v1.52.1.
Hello,
not sure if I missed something, but I got error opening model file built with gpu on non-cuda machine.
So, I am using: julia 1.8.2, EvoTrees 0.12.3, WIN10, on both machines.
On one machine I built model with gpu - nvidia card with CUDA.jl and CUDA toolkit.
Model is saved with JLSO package.
Than later I want to open this model on the other machine without CUDA toolkit and I got error:
[warn | JLSO]: Could not find the CUDA driver library. Please make sure you have installed the NVIDIA driver for your GPU.
If you're sure it's installed, look for nvcuda.dll
in your system and make sure it's discoverable by the linker.
Typically, that involves adding an entry to PATH.
There is of course workaround, not to use gpu.
But would be nice to have what ie Flux.jl have: cpu(model) <--> gpu(model) to "translate" model from "gpu world" to "cpu only world".
Typical usecase is to build model on powefull gpu-have machine, than later for "prediction-phase" use it on regular weaker machine.
Is there something I missed? Can I use model built with gpu on non-cuda machine?
Cheers, Lubo
I am making JLBoost.jl which is similar in many ways
Add a seed parameter to provide reproducibility when sampling on samples and features.
In preparing PR #158 I discovered that the new document strings for each each include this section:
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X, y)
where
X
: any table of input features (eg, a DataFrame
) whose columnsContinuous
,Count
, or <:OrderedFactor
; check column scitypes with schema(X)
However, this is at odds with the input_scitype
declarations which allows tables and vectors, but columns must all be Continuous
.
My guess is that it is the requirement in the doc-string actually works and it is just the input_scitype
declarations that need updating. @jeremiedb Can you confirm?
Wondering if there is a good reason this was dropped?
This is causing a minor issue, perhaps related to #64, in that the MLJ Model registry is still generated using julia 1.3. This means that the model metadata that goes into the registry for EvoTrees.jl is from the last version supporting julia 1.3, which excludes, for example, the fact that iteration_parameter
is now :rounds
instead of nothing
.
There may be another solution which involves generating the registry using a later julia version, but I was trying to delay this until the next LTS is announced.
I wonder if there's a way to iteratively train over chunks of input data (or even row by row), manually. We deal with data much larger than RAM and also doesn't fit the table interface -- in short, each "row" can contain many variables, some are vectors with un-fixed length, so we need to compute input to EvoTree on the fly.
How to implement EvoTrees model to handle data with missing values? Have any solution can make EvoTrees compatible with missing values(or NaN) like XGBoost model?
Maybe we can merge projects?
EvoTrees:
MemoryConstrainedTreeBoosting:
MemoryConstrainedTreeBoosting.jl is a library I've been working on for a couple years that would allow me to control the loading and binning of data, so I could (a) do feature engineering in Julia and (b) use all the memory on my machine for the binned data. I've also spent a lot of time on speed because 10% faster ≈ training done 1 day sooner for my data sets. I think it is quite fast. I didn't bother documenting the library until today, however.
With our powers combined...!
A benchmark below. 4 threads on my 2013 quad-core i7-4960HQ.
pkg> add https://github.com/brianhempel/MemoryConstrainedTreeBoosting.jl
using Statistics
using StatsBase:sample
using Revise
using EvoTrees
nrounds = 200
# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
loss=:logistic, metric=:logloss,
nrounds=nrounds,
λ=0.5, γ=0.0, η=0.05,
max_depth=6, min_weight=1.0,
rowsample=1.0, colsample=0.5, nbins=64)
# MemoryConstrainedTreeBoosting params
params_mctb = (
weights = nothing,
bin_count = 64,
iteration_count = nrounds,
min_data_weight_in_leaf = 1.0,
l2_regularization = 0.5,
max_leaves = 32,
max_depth = 6,
max_delta_score = 1.0e10, # Before shrinkage.
learning_rate = 0.05,
feature_fraction = 0.5, # Per tree.
bagging_temperature = 0.0,
)
nobs = Int(1e6)
num_feat = Int(100)
@info "testing with: $nobs observations | $num_feat features."
X = rand(Float32, nobs, num_feat)
Y = Float32.(rand(Bool, size(X, 1)))
@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
@time fit_evotree(params_evo, X, Y);
@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
@time EvoTrees.predict(m_evo, X);
import MemoryConstrainedTreeBoosting
@info "MemoryConstrainedTreeBoosting train CPU:"
@time bin_splits, trees = MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@time MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@info "MemoryConstrainedTreeBoosting predict CPU, JITed:"
save_path = tempname()
MemoryConstrainedTreeBoosting.save(save_path, bin_splits, trees)
unbinned_predict = MemoryConstrainedTreeBoosting.load_unbinned_predictor(save_path)
@time pred_mctb = unbinned_predict(X)
@time unbinned_predict(X)
$ JULIA_NUM_THREADS=4 julia --project=. experiments/benchmarks_v2.jl
[ Info: testing with: 1000000 observations | 100 features.
[ Info: evotrees train CPU:
98.929771 seconds (64.89 M allocations: 21.928 GiB, 2.12% gc time)
83.160324 seconds (187.35 k allocations: 18.400 GiB, 1.69% gc time)
[ Info: evotrees predict CPU:
2.458015 seconds (4.50 M allocations: 246.320 MiB, 38.75% compilation time)
1.598223 seconds (4.59 k allocations: 4.142 MiB)
[ Info: MemoryConstrainedTreeBoosting train CPU:
20.320708 seconds (16.04 M allocations: 2.480 GiB, 1.48% gc time, 0.01% compilation time)
15.954224 seconds (3.10 M allocations: 1.714 GiB, 2.66% gc time)
[ Info: MemoryConstrainedTreeBoosting predict CPU, JITed:
14.364365 seconds (11.80 M allocations: 692.582 MiB, 25.95% compilation time)
0.778851 seconds (40 allocations: 30.520 MiB)
Hello,
Thank you for the work here!
Apologies if this is not the right place for the following question. As I understand it, it seems the MLJModelInterface.fit
method for EvoTypes
does not allow for general tables (The machine
interface works well because it calls the reformat
function beforehand):
using EvoTrees
using MLJBase
n = 100
X = MLJBase.table(rand(n, 3))
y = rand(n)
evo = EvoTreeRegressor()
MLJBase.fit(evo, 1, X, y)
From the MLJ doc I thought that should be the case or am I understanding it wrong?
Ran into a bug when trying to tune a EvoTreeRegressor in MLJ. Isolated it to this:
using MLJ
using EvoTrees
using MLJModelInterface
const MMI = MLJModelInterface
X, y = @load_boston
model = (@load EvoTreeRegressor)()
data = MMI.reformat(model, X, y)
f, c, r = MMI.fit(model, 2, data...);
model.λ = 0.1
julia> MMI.update(model, 2, f, c, data...);
ERROR: ArgumentError: Function `matrix` only supports AbstractMatrix or containers implementing the Tables interface.
Stacktrace:
[1] matrix(::FullInterface, ::Val{:other}, X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
[2] matrix(::FullInterface, ::Val{:other}, X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}})
@ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
[3] matrix(X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
[4] matrix
@ ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27 [inlined]
[5] reformat(#unused#::EvoTrees.EvoTreeRegressor{Float64, EvoTrees.Linear, Int64}, X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}, y::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
@ EvoTrees ~/.julia/packages/EvoTrees/oHLKA/src/MLJ.jl:24
[6] update(model::EvoTrees.EvoTreeRegressor{Float64, EvoTrees.Linear, Int64}, verbosity::Int64, fitresult::EvoTrees.GBTree{1, Float64, Int64}, cache::NamedTuple{(:params, :X, :Y_cpu, :pred_cpu, :𝑖_, :𝑗_, :𝑖, :𝑗, :δ, :δ², :𝑤, :edges, :X_bin, :train_nodes, :splits, :hist_δ, :hist_δ², :hist_𝑤), Tuple{EvoTrees.EvoTreeRegressor{Float64, EvoTrees.Linear, Int64}, Matrix{Float64}, Vector{Float64}, Vector{StaticArrays.SVector{1, Float64}}, Vector{Int64}, Vector{Int64}, Vector{Int64}, Vector{Int64}, Vector{StaticArrays.SVector{1, Float64}}, Vector{StaticArrays.SVector{1, Float64}}, Vector{StaticArrays.SVector{1, Float64}}, Vector{Vector{Float64}}, Matrix{UInt8}, Vector{EvoTrees.TrainNode{1, Float64, Int64}}, Vector{EvoTrees.SplitInfo{1, Float64, Int64}}, Vector{Matrix{StaticArrays.SVector{1, Float64}}}, Vector{Matrix{StaticArrays.SVector{1, Float64}}}, Vector{Matrix{StaticArrays.SVector{1, Float64}}}}}, A::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}, y::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
@ EvoTrees ~/.julia/packages/EvoTrees/oHLKA/src/MLJ.jl:39
[7] top-level scope
@ REPL[183]:1
Happy to look into this.
this feels like a deal breaker for people trying to migrate XGBoost -> EvoTrees.
What's the road blocker?
The last line of the following block throws an exception
using Pkg;
Pkg.add(["DataFrames", "CSV", "TabularDisplay", "CategoricalArrays"]);
using DataFrames, CSV, TabularDisplay, CategoricalArrays
Pkg.add(["MLJ", "EvoTrees", "MLJScientificTypes"])
using MLJ, EvoTrees, MLJScientificTypes
num_cols = [
"ClientPeriod",
"MonthlySpending",
"TotalSpent"
];
cat_cols = [
"Sex",
"IsSeniorCitizen",
"HasPartner",
"HasChild",
"HasPhoneService",
"HasMultiplePhoneNumbers",
"HasInternetService",
"HasOnlineSecurityService",
"HasOnlineBackup",
"HasDeviceProtection",
"HasTechSupportAccess",
"HasOnlineTV",
"HasMovieSubscription",
"HasContractPhone",
"IsBillingPaperless",
"PaymentMethod"
];
all_feature_cols = [num_cols; cat_cols];
target_col = "Churn";
#,types=Dict("Sex"=>CategoricalValue{String, UInt32})
# + tags=[]
df = DataFrame!(CSV.File("./train.csv",pool=0.1, missingstrings=[" "]))
categorical!(df,[cat_cols;target_col]);
describe(df,:eltype,:nunique, :nmissing)
dropmissing!(df);
describe(df,:eltype,:nunique, :nmissing)
X = df[!, all_feature_cols];
y = df[!,target_col];
mach_x = machine(ContinuousEncoder(), X)
fit!(mach_x)
X = MLJ.transform(mach_x, X)
tree_model = EvoTreeClassifier(max_depth=6, nrounds=2000,colsample=0.3)
mach = machine(tree_model, X, y)
train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split
fit!(mach, rows=train, verbosity=1)
pred_test = MLJ.predict(mach, selectrows(X, test))
The problem seems to be related to JuliaAI/MLJBase.jl#525
Here is the exception.
DomainError with Probabilities must be in [0,1].:
Stacktrace:
[1] _err_01()
@ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:42
[2] _check_probs_01(probs::Vector{Float32})
@ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:66
[3] _broadcast_getindex_evalf
@ ./broadcast.jl:648 [inlined]
[4] _broadcast_getindex
@ ./broadcast.jl:621 [inlined]
[5] getindex
@ ./broadcast.jl:575 [inlined]
[6] copy
@ ./broadcast.jl:922 [inlined]
[7] materialize
@ ./broadcast.jl:883 [inlined]
[8] UnivariateFinite(::MLJModelInterface.FullInterface, prob_given_class::OrderedCollections.LittleDict{CategoricalValue{Int64, UInt8}, AbstractVector{Float32}, Vector{CategoricalValue{Int64, UInt8}}, Vector{AbstractVector{Float32}}}; kwargs::Base.Iterators.Pairs{Symbol, Union{Missing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:pool, :ordered), Tuple{Missing, Bool}}})
@ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:127
[9] _UnivariateFinite(support::CategoricalVector{Int64, UInt8, Int64, CategoricalValue{Int64, UInt8}, Union{}}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}, N::Int64; augment::Bool, kwargs::Base.Iterators.Pairs{Symbol, Union{Missing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:pool, :ordered), Tuple{Missing, Bool}}})
@ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:245
[10] _UnivariateFinite(support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}, N::Int64; augment::Bool, pool::Missing, ordered::Bool)
@ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:287
[11] #_UnivariateFinite#37
@ ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:308 [inlined]
[12] UnivariateFinite(::MLJModelInterface.FullInterface, support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}; kwargs::Base.Iterators.Pairs{Symbol, Missing, Tuple{Symbol}, NamedTuple{(:pool,), Tuple{Missing}}})
@ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:212
[13] UnivariateFinite(support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}; kwargs::Base.Iterators.Pairs{Symbol, Missing, Tuple{Symbol}, NamedTuple{(:pool,), Tuple{Missing}}})
@ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:431
[14] predict(#unused#::EvoTreeClassifier{Float32, EvoTrees.Softmax, Int64}, fitresult::EvoTrees.GBTree{2, Float32, Int64}, A::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}})
@ EvoTrees ~/.julia/packages/EvoTrees/L5jFX/src/MLJ.jl:56
[15] predict(mach::Machine{EvoTreeClassifier{Float32, EvoTrees.Softmax, Int64}, true}, Xraw::DataFrame)
@ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/operations.jl:83
[16] top-level scope
@ In[22]:1
[17] eval
@ ./boot.jl:360 [inlined]
[18] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1094
This behavior creates a real problem when doing a hyper-parameter search as per https://alan-turing-institute.github.io/MLJ.jl/stable/#Lightning-tour-1
The data is attached
train.zip
Currently feature importances from EvoTrees are accessed via report()
. We should update the interface to comply with the new method defined here: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Feature-importances. In particular I suggest we do the following:
report()
MLJModelInterface.feature_importances(model::M, fitresult, report)
to compute feature importances as desiredMLJModelInterface.reports_feature_importances(::Type{<:M}) = true
for each EvoTrees model so MLJ knows we can access feature importances.@jeremiedb It seems the new model does not have MLJModelInterface.Model
as a supertype:
using MLJModels
using EvoTrees
julia> ms = MLJModels.finaltypes(MLJModels.Model);
julia> filter(ms) do m
Base.parentmodule(m) == EvoTrees
end
4-element Vector{Type}:
EvoTreeRegressor
EvoTreeClassifier
EvoTreeCount
EvoTreeGaussian
EvoSplineRegressor
sounds like a new model. Just updated MLJ model registry and this appeared as an "orphan" because the package_name
is "unknown"
:
julia> info("EvoSplineRegressor")
(name = "EvoSplineRegressor",
package_name = "unknown",
is_supervised = true,
abstract_type = MLJModelInterface.Deterministic,
deep_properties = (),
docstring = "```\nEvoSplineRegressor(; kwargs...)\n```\n\nA model t...",
fit_data_scitype =
Tuple{Union{ScientificTypesBase.Table{<:Union{AbstractVector{<:ScientificTypesBase.Continuous}, AbstractVector{<:ScientificTypesBase.Count}, AbstractVector{<:ScientificTypesBase.OrderedFactor}}}, AbstractMatrix{ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Continuous}},
human_name = "evo spline regressor",
hyperparameter_ranges = (nothing,
nothing,
nothing,
nothing,
nothing,
nothing,
nothing,
nothing,
nothing),
hyperparameter_types = ("Int64",
"Symbol",
"Int64",
"Symbol",
"Any",
"Any",
"Union{Nothing, Dict}",
"Any",
"Symbol"),
hyperparameters =
(:nrounds, :opt, :batchsize, :act, :eta, :L2, :knots, :rng, :device),
implemented_methods = [:fit, :predict, :update],
inverse_transform_scitype = ScientificTypesBase.Unknown,
is_pure_julia = false,
is_wrapper = false,
iteration_parameter = :nrounds,
load_path = "EvoLinear.EvoSplineRegressor",
package_license = "unknown",
package_url = "unknown",
package_uuid = "unknown",
predict_scitype = AbstractVector{<:ScientificTypesBase.Continuous},
prediction_type = :deterministic,
reporting_operations = (),
reports_feature_importances = false,
supports_class_weights = false,
supports_online = false,
supports_training_losses = false,
supports_weights = false,
transform_scitype = ScientificTypesBase.Unknown,
input_scitype =
Union{ScientificTypesBase.Table{<:Union{AbstractVector{<:ScientificTypesBase.Continuous}, AbstractVector{<:ScientificTypesBase.Count}, AbstractVector{<:ScientificTypesBase.OrderedFactor}}}, AbstractMatrix{ScientificTypesBase.Continuous}},
target_scitype = AbstractVector{<:ScientificTypesBase.Continuous},
output_scitype = ScientificTypesBase.Unknown)
Maybe this issue should be transferred to EvoLinear.jl.
@JuliaRegistrator register()
julia> rgs = (@load EvoTreeRegressor)();
[ Info: For silent loading, specify `verbosity=0`.
import EvoTrees ✔
julia> X, y = make_regression();
julia> mach = machine(rgs, X, y) |> fit!
[ Info: Training Machine{EvoTreeRegressor{Float64,…},…} @083.
Machine{EvoTreeRegressor{Float64,…},…} @083 trained 1 time; caches data
args:
1: Source @339 ⏎ `Table{AbstractVector{Continuous}}`
2: Source @776 ⏎ `AbstractVector{Continuous}`
julia> typeof(predict(mach, rows=1:3))
Matrix{Float64} (alias for Array{Float64, 2})
julia> predict(mach, rows=1:3)
3×1 Matrix{Float64}:
0.1765737240681419
-0.31087790314425967
0.4214986666800992
The MLJ API requires that predict
return the same kind of object as y
, ie a vector. In particular, the current implementation prevents any kind of evaluation of the model:
julia> rms(predict(mach, rows=:), y)
ERROR: MethodError: no method matching (::RootMeanSquaredError)(::Matrix{Float64}, ::Vector{Float64})
Closest candidates are:
(::RootMeanSquaredError)(::AbstractVector{var"#s1028"} where var"#s1028"<:Real, ::AbstractVector{var"#s1027"} where var"#s1027"<:Real) at /Users/anthony/.julia/packages/MLJBase/j0qGA/src/measures/continuous.jl:78
(::RootMeanSquaredError)(::AbstractVector{var"#s1028"} where var"#s1028"<:Real, ::AbstractVector{var"#s1027"} where var"#s1027"<:Real, ::AbstractVector{var"#s1026"} where var"#s1026"<:Real) at /Users/anthony/.julia/packages/MLJBase/j0qGA/src/measures/continuous.jl:88
Stacktrace:
[1] top-level scope
@ REPL[21]:1
This is holding up: JuliaAI/DataScienceTutorials.jl#162
using EvoTrees
using EvoTrees: sigmoid, logit
using MLJBase
using RDatasets
iris = dataset("datasets", "iris")
iris[!, :is_setosa] = iris[!, :Species] .== "setosa"
features = setdiff(names(iris), ["Species", "is_setosa"])
Y, X, _ = unpack(iris, ==(:is_setosa), in(Symbol.(features)), colname -> true)
train, test = partition(eachindex(Y), 0.7, shuffle=true); # 70:30 split
tree_model = EvoTreeClassifier(
loss=:linear, metric=:mse,
nrounds=100, nbins = 100,
λ = 0.5, γ=0.1, η=0.1,
max_depth = 6, min_weight = 1.0,
rowsample=0.5, colsample=1.0)
mach = machine(tree_model, X, Y)
results in this warning
Warning: The number and/or types of data arguments do not match what the specified model supports. Suppress this type check by specifying `scitype_check_level=0`.
│
│ Run `@doc EvoTreeClassifier{Float64, EvoTrees.Softmax, Int64}` to learn more about your model's requirements.
│ Commonly, but non exclusively, supervised models are constructed using the syntax `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are constructed with `machine(model, X)`. Here `X` are features, `y` a target, and `w` sample or class weights.
Why could this be happening?
for some ungodly reason we need to let other people use our model and they don't use Julia everywhere
@jeremiedb Saw your recent doc PR and thought I'd flag this. I understand you may not have the bandwidth just now, but should be aware there is now an "official" format for these strings, so you don't work against it unwittingly.
This will allow your classifier to buy into this performance improvement. Apart from the [compat] update, the only other step should be to replace the following line
Line 46 in 4235aa2
with
return MLJModelInterface.UnivariateFinite(fitresult.levels, pred)
which returns an instance of an abstract vector of eltype <: UnivariateFinite instead of a vanilla vector of the same eltype. You should not need to change any tests - this object should have all the same behaviour as the old one.
FYI: I am releasing today an MLJModels update which will also have MLJModelInterface 0.3 as a lower bound (and performance buy-in for all the classifiers with implementations there).
Let me know if you have questions.
Training losses don't seem to be exposed in the MLJ interface. This would be nice. I'm close to finishing up an iterative control wrapper for MLJ models and if they report training losses, one can apply "training progress" modified stopping criterion, such as PQ
Here the changes since 0.1.8, which I doubt break anything in EvoTrees:
Bring in is_same_except
and similar gizmos from MLJBase; see JuliaAI/MLJBase.jl#178
Let all models share same traits (#19, JuliaAI/MLJBase.jl#163)
Add save
and restore
stubs for serialisation
If you wanted to make use of is_same_except
you would need the more restrictive
[compat] MLJModelInterface="^0.2"
running
mach = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y, cache=false)
mach1 = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y1, cache=false)
mach2 = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y2, cache=false)
...
could add to gpu memory pool usage by several GBs after each line run. Is it possible to free everything used in GPU training as I would only need CPU when prediction?
Is there any chance you can add something similar to lightgbm.LGBMRanker()? There doesn't seem to be any way of doing this easily in Julia without having to use PyCall and CatBoost...
I have two models stacked, one of each is an ensemble of EvoTreeClassifier
models, and I am working with MLJ
.
I want to know the feature importance according to the ensemble of EvoTreeClassifier
s, but I am getting the following error:
julia> f = fitted_params(tuned_stack)
julia> features_gain = importance(f.best_fitted_params.model2.fitresult, wvn_table)
ERROR: MethodError: no method matching importance(::MLJ.WrappedEnsemble{EvoTrees.GBTree{2,Float32,Int64},EvoTreeClassifier{Float32,EvoTrees.Softmax,Int64}}, ::Array{Any,1})
Closest candidates are:
importance(::EvoTrees.GBTree, ::AbstractArray{T,1} where T)
at C:\Users\ivica\.julia\packages\EvoTrees\9NdxH\src\importance.jl:12
For clarity, the output of f.best_fitted_params.model2.fitresult
is
julia> f.best_fitted_params.model2.fitresult
WrappedEnsemble{GBTree{,…},…} @546
I understand that I have to feed an object of the type ::EvoTrees.GBTree
to the importance()
function... How could I do so when my model is an ensemble of ::EvoTrees.GBTree
s?
As discussed in #100
Hi and thank you for your package!
I love how flexible it is.
I've had good results w/ Cubist models: in which terminal leaves contain linear regression models, as opposed to simple averages.
Would it be hard to include this type of option in EvoTrees?
How do you deploy? Consider adding save/load functions? Or be able to export to some format of other packages like lightgbm/xgboost to leverage the package dmlc/treelite: model compiler for decision tree ensembles?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.