evovest / evotrees.jl Goto Github PK

Boosted trees in Julia

Home Page: https://evovest.github.io/EvoTrees.jl/dev/

License: Apache License 2.0

Julia 100.00%

boosted-trees decision-tree gbrt gradient-boosting gradientboosting julia logistic machine-learning poisson quantile regression

evotrees.jl's People

Contributors

Stargazers

Watchers

evotrees.jl's Issues

Feature Request: Support Tables.jl

Happy to contribute if there is a desire to support Tables.jl

Feature Request

I was wondering if we can have a feature where the learning rate is reduced by some percent (user-defined parameter) once the eval metric increases by some amount.
So instead of early_stopping after 20 rounds, the learning rate might be reduced by 90% instead.
This should allow the model to start learning again.
The idea is to generate more trees in the low loss space of models.
Consistently reducing the learning rate should allow us to move more slowly in this space and harvest a lot more models to average over.

Performance and unexpected determinism

@jeremiedb You may want to look at this Julia discourse thread. Sorry for opening here - I do not know your Discourse handle.

Multi-class support

Add softmax loss function

Increasing max_depth causes memory leak

I have been able to train an EvoTreeRegressor with the default parameters successfully. When I try to increase the max_depth parameter beyond 10 suddenly my memory usage spikes and Julia dies.

Here's a snippet from the REPL

julia> evo = EvoTreeRegressor(max_depth=15, rng=42)
EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 10,
    λ = 0.0,
    γ = 0.0,
    η = 0.1,
    max_depth = 15,
    min_weight = 1.0,
    rowsample = 1.0,
    colsample = 1.0,
    nbins = 64,
    α = 0.5,
    metric = :mse,
    rng = MersenneTwister(42),
    device = "cpu")

julia> mach = machine(evo, Xtrain, CDOM_train)
Machine{EvoTreeRegressor{Float64,…},…} trained 0 times; caches data
  args: 
    1:  Source @710 ⏎ `Table{AbstractVector{Continuous}}`
    2:  Source @134 ⏎ `AbstractVector{Continuous}`


julia> fit!(mach, verbosity=2)
[ Info: Training Machine{EvoTreeRegressor{Float64,…},…}.

Process julia killed

Expose RNG (again)

I see that the new MLJ models only expose a seed rather than the RNG. Is there a reason for this restriction?

To generate multiple learning curves for an MLJ model, one needs access to the RNG.

For convenience, setting the RNG field to an integer i could instantiate a MersenneTwister(i). (In MLJ, most rng fields or key-words can be integers or AbstractRNG)

Feature Importance Doesn't Allow Feature Names

Hi, I've noticed that with the recent PR that the feature importance function no longer allows you to enter feature names as an argument. Consider the example

using EvoTrees
using Statistics
using StatsBase: sample

# prepare a dataset
features = rand(Int(1.25e6), 100)
# features = rand(100, 10)
X = features
Y = rand(size(X, 1))
𝑖 = collect(1:size(X, 1))

# train-eval split
𝑖_sample = sample(𝑖, size(𝑖, 1), replace=false)
train_size = 0.8
𝑖_train = 𝑖_sample[1:floor(Int, train_size * size(𝑖, 1))]
𝑖_eval = 𝑖_sample[floor(Int, train_size * size(𝑖, 1))+1:end]

x_train, x_eval = X[𝑖_train, :], X[𝑖_eval, :]
y_train, y_eval = Y[𝑖_train], Y[𝑖_eval]

config = EvoTreeClassifier(
    loss=:linear, 
    nrounds=100, 
    nbins=100,
    lambda=0.5, 
    gamma=0.1, 
    eta=0.1,
    max_depth=6, 
    min_weight=1.0,
    rowsample=0.5, 
    colsample=1.0)

model = fit_evotree(config; x_train = training_features, y_train = training_labels, x_eval = validation_features, y_eval = validation_labels, print_every_n = 1)

display(importance(model))

which gives an output

4-element Vector{Pair{String, Float64}}:
 "feat_3" => 0.26565039212451985
 "feat_4" => 0.2589711676696925
 "feat_1" => 0.24700503862705744
 "feat_2" => 0.22837340157873026

Could you please correct this to add another method that allows feature names to be entered into importance as it was previously? Thanks

EvoTrees consumes too much memory and crashes for sparse matrices

Let us consider the real ML problem (I cannot show the feature names but each row is a separate feature) with 600k observations:

As we can see we have many categorical features with approximately 1500 categories for each feature on average.

Let us approximate real features with random one-hot encoded features represented by sparse matrices and try to apply both XGBoost and EvoTrees:

As we can see XGBoost handles very sparse data very well while EvoTrees allocates too much memory and crashes with OOM error.

The notebook is attached
julia_evotrees.zip

The code to reproduce as the plain script:

# -*- coding: utf-8 -*-
# ---
# jupyter:
#   jupytext:
#     formats: ipynb,jl:light
#     text_representation:
#       extension: .jl
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.11.1
#   kernelspec:
#     display_name: Julia 1.6.0-rc3
#     language: julia
#     name: julia-1.6
# ---

# + tags=[]
using Pkg; Pkg.activate(".");

# + tags=[]
Pkg.add(["Statistics", "StatsBase", "Revise", "EvoTrees", "BenchmarkTools", "CUDA", "SparseArrays"]);

# + tags=[]
using Statistics
using StatsBase
using XGBoost
using Revise
using EvoTrees
using BenchmarkTools
using CUDA
using SparseArrays

# + tags=[]
nrounds = 200;
nthread = Threads.nthreads();

# + tags=[]
# xgboost aprams
params_xgb = ["max_depth" => 5,
         "eta" => 0.05,
         "objective" => "reg:squarederror",
         "print_every_n" => 5,
         "subsample" => 0.5,
         "colsample_bytree" => 0.5,
         "tree_method" => "hist",
         "max_bin" => 64]
metrics = ["rmse"]

# + tags=[]
# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
        loss=:linear, metric=:mse,
        nrounds=nrounds, α=0.5,
        λ=0.0, γ=0.0, η=0.05,
        max_depth=6, min_weight=1.0,
        rowsample=0.5, colsample=0.5, nbins=64)

# + tags=[]
function random_select(n::Int64,K::Int64) 
    @assert 0<=n<=K

    sample=Vector{Int64}(undef, n)
    t=Int64(0)
    m=Int64(0)

    while m<n
        if (K-t)*rand()>=n-m
            t+=1
        else
            m+=1
            sample[m]=t
            t+=1
        end
    end
    sample
end


# + tags=[]
function create_sparseMatrix(n::Int64,N::Int64,M::Int64)
    @assert (0<=N)&&(0<=M)
    @assert 0<=n<=N*M

    nonZero = random_select(n,N*M)

    # column major: k=i+j*N
    I = map(k->mod(k,N),nonZero)
    J = map(k->div(k,N),nonZero)

    sparse(I.+1,J.+1,ones(n),N,M)
end
# -

sparseones(N,M,K) = sparse(
  (x->(first.(x).+1,last.(x).+1))(divrem.(sample(0:N*M-1,K,replace=false),M))...,
  ones(K),N,M
)

# + tags=[]
nobs = Int(600000)
num_feat = Int(50);
n_cats_per_feature = Int(1500);
@info "testing with: $nobs observations | $num_feat features."

# + tags=[]
X = sparseones(nobs, num_feat*n_cats_per_feature, Int64(nobs*num_feat));

# + tags=[]
Y = rand(size(X, 1));

# + tags=[]
@info "xgboost train:"
@time m_xgb = xgboost(X, nrounds, label=Y, param=params_xgb, metrics=metrics, nthread=nthread, silent=1);
# @btime xgboost($X, $nrounds, label=$Y, param=$params_xgb, metrics=$metrics, silent=1);

# + tags=[]
@info "xgboost predict:"
@time pred_xgb = XGBoost.predict(m_xgb, X);
# @btime XGBoost.predict($m_xgb, $X);

# + tags=[]
@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
# @btime fit_evotree($params_evo, $X, $Y);
# -

@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
#@btime EvoTrees.predict($m_evo, $X);

# + tags=[]
CUDA.allowscalar(false)
@info "evotrees train GPU:"
params_evo.device = "gpu"
@time m_evo_gpu = fit_evotree(params_evo, X, Y);
#@btime fit_evotree($params_evo, $X, $Y);

# + tags=[]
@info "evotrees predict GPU:"
@time pred_evo = EvoTrees.predict(m_evo_gpu, X);
#@btime EvoTrees.predict($m_evo_gpu, $X);
# -

Crash occured when fitting

Hi and thanks for this package!

I noticed a bug when I was running the following code. I have a black screen during 5min, then I have the attached screen. We tried the same code under a Linux distribution (also using VS Code) and everything went fine, so I guess it's a W10-linked issue! Please feel free to ask if you need more info

vs_crash

using EvoTrees
println(stdout, "Tuning EvoTrees...")
MLJ.@load EvoTreeClassifier
evotree_model_base = EvoTreeClassifier(nrounds=100)

# Step 1 of our tuning
evotree_step1 = [  range(evotree_model_base, :max_depth; lower=3, upper=10, unit=1),
                            range(evotree_model_base, :min_weight; lower=1, upper=6, unit=1)   ]
tuned_evotree_model = TunedModel(model=evotree_model_base,
                                            tuning=rnd_search,
                                            resampling=StrCV,
                                            repeats=3,
                                            # n=5,
                                            range=evotree_step1,
                                            measure=probabilistic_accuracy,
                                            acceleration=CPUThreads()
                                            )
tuned_evotree = machine(tuned_evotree_model, X, y)

println(stdout, "EvoTrees: Step 1")
@time fit!(tuned_evotree, verbosity=1, rows=train_indexes)

savefig(plot(tuned_evotree), "HeatMap_EvoTrees_Step1.png")
evotree_model = fitted_params(tuned_evotree).best_model

supports_online

Is there builtin API for online data adding? Or plan to support the MLJ supports_online trait?

Categorical and mixed features types

Support non one-hot encoded categorical features: features carrying item info as an Int (1 to N levels).
Consider change from Matrix to DataFrames input structure to handle mixed feature.
Consider supporting mix of input structures: DataFrames + SparseMatrix for efficieent handling of mixture of dense (continuous and categorical) and sparse features.

Unable to plot using the README example

First, thanks for the package - super user-friendly and blazing fast.

I'm trying to plot the tree as in the tutorial, and it doesn't seem to work. If I do

plot(model)

It gives me a single box with the bias (which is cool). But if try something like

plot(model, 2)

what I get is a stack trace (which is less cool):

julia> plot(model, 3)
ERROR: type UnionAll has no field layout
Stacktrace:
  [1] getproperty(x::Type, f::Symbol)
    @ Base ./Base.jl:28
  [2] macro expansion
    @ ~/.julia/packages/EvoTrees/pYJaO/src/plot.jl:108 [inlined]
  [3] apply_recipe(plotattributes::AbstractDict{Symbol, Any}, model::EvoTrees.GBTree, n::Any, var_names::Any)
    @ EvoTrees ~/.julia/packages/RecipesBase/3fzVq/src/RecipesBase.jl:283
  [4] apply_recipe(plotattributes::AbstractDict{Symbol, Any}, model::EvoTrees.GBTree, n::Any)
    @ EvoTrees ~/.julia/packages/RecipesBase/3fzVq/src/RecipesBase.jl:277
  [5] _process_userrecipes!(plt::Any, plotattributes::Any, args::Any)
    @ RecipesPipeline ~/.julia/packages/RecipesPipeline/Bxu2O/src/user_recipe.jl:36
  [6] recipe_pipeline!(plt::Any, plotattributes::Any, args::Any)
    @ RecipesPipeline ~/.julia/packages/RecipesPipeline/Bxu2O/src/RecipesPipeline.jl:70
  [7] _plot!(plt::Plots.Plot, plotattributes::Any, args::Any)
    @ Plots ~/.julia/packages/Plots/5kcBO/src/plot.jl:208
  [8] plot(::Any, ::Vararg{Any, N} where N; kw::Any)
    @ Plots ~/.julia/packages/Plots/5kcBO/src/plot.jl:91
  [9] plot(::Any, ::Any)
    @ Plots ~/.julia/packages/Plots/5kcBO/src/plot.jl:85
 [10] top-level scope
    @ REPL[134]:1

The issue is in the call to Buchheim, and if I replace it by

tree_layout = length(adj) == 1 ? [[0.0,0.0]] : NetworkLayout.Buchheim()(adj)

it works - I can submit a PR if you want.

device="gpu" not working with MLJ interface

I just ran the 2 examples in readme, only difference is device parameter. The MLJ interface doesn' t use gpu when fit! while the internal API version uses the gpu well.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

poor results than XGboost, Lightboost, CatBoost, could you help to find better results ?

i found evotree is only pure julia ,gradient boost package. I appriciate it .
how to do hyper parameter optimization ?
you have compared run time of code , what happed to quality of results?
Please add example with results comparison with any real dataset with hyperparameter optimization.

Thank you

Plot loss curve history for train and eval data set

is the way to go "fit" for 1 round manually ?....

indexing using GPU

I took the code from the README. If you use GPU then it doesn't work after is disallow indexing.

using EvoTrees
using EvoTrees: sigmoid, logit

# prepare a dataset
features = rand(10000*20) .* 20 .- 10
X = reshape(features, (10000, 20))
Y = sin.(features) .* 0.5 .+ 0.5
Y = logit(Y) + randn(size(Y))
Y = sigmoid(Y)
i = collect(1:size(X, 1))

# train-eval split
i_sample = sample(i, size(i, 1), replace = false)
train_size = 0.8
i_train = i_sample[1:floor(Int, train_size * size(i, 1))]
i_eval = i_sample[floor(Int, train_size * size(i, 1))+1:end]

X_train, X_eval = X[i_train, :], X[i_eval, :]
Y_train, Y_eval = Y[i_train], Y[i_eval]

params1 = EvoTreeRegressor(
    loss=:linear, metric=:mse,
    nrounds=100, nbins = 100,
    λ = 0.5, γ=0.1, η=0.1,
    max_depth = 6, min_weight = 1.0,
    rowsample=0.5, colsample=1.0, device="gpu")

using CUDA
CUDA.allowscalar(false)
@time model = fit_evotree(params1, X_train, Y_train, X_eval = X_eval, Y_eval = Y_eval, print_every_n = 25)

Unable to reproduce plot from Readme tutorial

Hi and thanks for this package :)

I was trying to plot my decision tree using the plot(model, n) method outlined at the end of the Readme tutorial, but got the following error:

ERROR: Cannot convert EvoTrees.GBTree{1,Float32,Int64} to series data for plotting.

I came back to this repo and copy-pasted all the code from the Readme file, and still got the same error, so it should be fairly easy to replicate.

I am using EvoTrees v0.5.3, Plots v1.6.12 and Julia v1.4.2 on VSCode v1.52.1.

GPU saved model not possible open without CUDA

Hello,

not sure if I missed something, but I got error opening model file built with gpu on non-cuda machine.

So, I am using: julia 1.8.2, EvoTrees 0.12.3, WIN10, on both machines.
On one machine I built model with gpu - nvidia card with CUDA.jl and CUDA toolkit.
Model is saved with JLSO package.

Than later I want to open this model on the other machine without CUDA toolkit and I got error:

[warn | JLSO]: Could not find the CUDA driver library. Please make sure you have installed the NVIDIA driver for your GPU.
If you're sure it's installed, look for nvcuda.dll in your system and make sure it's discoverable by the linker.
Typically, that involves adding an entry to PATH.

There is of course workaround, not to use gpu.
But would be nice to have what ie Flux.jl have: cpu(model) <--> gpu(model) to "translate" model from "gpu world" to "cpu only world".

Typical usecase is to build model on powefull gpu-have machine, than later for "prediction-phase" use it on regular weaker machine.
Is there something I missed? Can I use model built with gpu on non-cuda machine?

Cheers, Lubo

JLBoost.jl similar

I am making JLBoost.jl which is similar in many ways

Add a seed parameter

Add a seed parameter to provide reproducibility when sampling on samples and features.

Inconsistencies in input scitype declarations and recently added document strings for MLJ API

In preparing PR #158 I discovered that the new document strings for each each include this section:

Training model

In MLJ or MLJBase, bind an instance model to data with
mach = machine(model, X, y) where

X: any table of input features (eg, a DataFrame) whose columns
each have one of the following element scitypes: Continuous,
Count, or <:OrderedFactor; check column scitypes with schema(X)

However, this is at odds with the input_scitype declarations which allows tables and vectors, but columns must all be Continuous.

My guess is that it is the requirement in the doc-string actually works and it is just the input_scitype declarations that need updating. @jeremiedb Can you confirm?

Support for Julia 1.3

Wondering if there is a good reason this was dropped?

This is causing a minor issue, perhaps related to #64, in that the MLJ Model registry is still generated using julia 1.3. This means that the model metadata that goes into the registry for EvoTrees.jl is from the last version supporting julia 1.3, which excludes, for example, the fact that iteration_parameter is now :rounds instead of nothing.

There may be another solution which involves generating the registry using a later julia version, but I was trying to delay this until the next LTS is announced.

How to train over input that is >> larger than RAM?

I wonder if there's a way to iteratively train over chunks of input data (or even row by row), manually. We deal with data much larger than RAM and also doesn't fit the table interface -- in short, each "row" can contain many variables, some are vectors with un-fixed length, so we need to compute input to EvoTree on the fly.

Missing values (or NaN) compatibility

How to implement EvoTrees model to handle data with missing values? Have any solution can make EvoTrees compatible with missing values(or NaN) like XGBoost model?

Merge with MemoryConstrainedTreeBoosting.jl?

Maybe we can merge projects?

EvoTrees:

More loss functions supported
Cleaner code
Working on integrating with Julia ecosystem

MemoryConstrainedTreeBoosting:

4-5x faster on CPU
Early stopping

MemoryConstrainedTreeBoosting.jl is a library I've been working on for a couple years that would allow me to control the loading and binning of data, so I could (a) do feature engineering in Julia and (b) use all the memory on my machine for the binned data. I've also spent a lot of time on speed because 10% faster ≈ training done 1 day sooner for my data sets. I think it is quite fast. I didn't bother documenting the library until today, however.

With our powers combined...!

A benchmark below. 4 threads on my 2013 quad-core i7-4960HQ.

 pkg> add https://github.com/brianhempel/MemoryConstrainedTreeBoosting.jl

using Statistics
using StatsBase:sample
using Revise
using EvoTrees

nrounds = 200

# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
        loss=:logistic, metric=:logloss,
        nrounds=nrounds,
        λ=0.5, γ=0.0, η=0.05,
        max_depth=6, min_weight=1.0,
        rowsample=1.0, colsample=0.5, nbins=64)

# MemoryConstrainedTreeBoosting params
params_mctb = (
        weights                 = nothing,
        bin_count               = 64,
        iteration_count         = nrounds,
        min_data_weight_in_leaf = 1.0,
        l2_regularization       = 0.5,
        max_leaves              = 32,
        max_depth               = 6,
        max_delta_score         = 1.0e10, # Before shrinkage.
        learning_rate           = 0.05,
        feature_fraction        = 0.5, # Per tree.
        bagging_temperature     = 0.0,
      )

nobs = Int(1e6)
num_feat = Int(100)
@info "testing with: $nobs observations | $num_feat features."
X = rand(Float32, nobs, num_feat)
Y = Float32.(rand(Bool, size(X, 1)))


@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
@time fit_evotree(params_evo, X, Y);
@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
@time EvoTrees.predict(m_evo, X);


import MemoryConstrainedTreeBoosting

@info "MemoryConstrainedTreeBoosting train CPU:"
@time bin_splits, trees = MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@time MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@info "MemoryConstrainedTreeBoosting predict CPU, JITed:"
save_path = tempname()
MemoryConstrainedTreeBoosting.save(save_path, bin_splits, trees)
unbinned_predict = MemoryConstrainedTreeBoosting.load_unbinned_predictor(save_path)
@time pred_mctb = unbinned_predict(X)
@time unbinned_predict(X)

$ JULIA_NUM_THREADS=4 julia --project=. experiments/benchmarks_v2.jl
[ Info: testing with: 1000000 observations | 100 features.
[ Info: evotrees train CPU:
 98.929771 seconds (64.89 M allocations: 21.928 GiB, 2.12% gc time)
 83.160324 seconds (187.35 k allocations: 18.400 GiB, 1.69% gc time)
[ Info: evotrees predict CPU:
  2.458015 seconds (4.50 M allocations: 246.320 MiB, 38.75% compilation time)
  1.598223 seconds (4.59 k allocations: 4.142 MiB)
[ Info: MemoryConstrainedTreeBoosting train CPU:
  20.320708 seconds (16.04 M allocations: 2.480 GiB, 1.48% gc time, 0.01% compilation time)
  15.954224 seconds (3.10 M allocations: 1.714 GiB, 2.66% gc time)
[ Info: MemoryConstrainedTreeBoosting predict CPU, JITed:
 14.364365 seconds (11.80 M allocations: 692.582 MiB, 25.95% compilation time)
  0.778851 seconds (40 allocations: 30.520 MiB)

MLJModelInterface.fit does not accept tables?

Hello,

Thank you for the work here!

Apologies if this is not the right place for the following question. As I understand it, it seems the MLJModelInterface.fit method for EvoTypes does not allow for general tables (The machine interface works well because it calls the reformat function beforehand):

using EvoTrees
using MLJBase

n = 100
X = MLJBase.table(rand(n, 3))
y = rand(n)

evo = EvoTreeRegressor()
MLJBase.fit(evo, 1, X, y)

From the MLJ doc I thought that should be the case or am I understanding it wrong?

Bug in `MLJModelInterface.update`

Ran into a bug when trying to tune a EvoTreeRegressor in MLJ. Isolated it to this:

using MLJ
using EvoTrees
using MLJModelInterface
const MMI = MLJModelInterface

X, y = @load_boston
model = (@load EvoTreeRegressor)()
data = MMI.reformat(model, X, y)
f, c, r = MMI.fit(model, 2, data...);
model.λ = 0.1

julia> MMI.update(model, 2, f, c, data...);
ERROR: ArgumentError: Function `matrix` only supports AbstractMatrix or containers implementing the Tables interface.
Stacktrace:
 [1] matrix(::FullInterface, ::Val{:other}, X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
 [2] matrix(::FullInterface, ::Val{:other}, X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}})
   @ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
 [3] matrix(X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
 [4] matrix
   @ ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27 [inlined]
 [5] reformat(#unused#::EvoTrees.EvoTreeRegressor{Float64, EvoTrees.Linear, Int64}, X::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}, y::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
   @ EvoTrees ~/.julia/packages/EvoTrees/oHLKA/src/MLJ.jl:24
 [6] update(model::EvoTrees.EvoTreeRegressor{Float64, EvoTrees.Linear, Int64}, verbosity::Int64, fitresult::EvoTrees.GBTree{1, Float64, Int64}, cache::NamedTuple{(:params, :X, :Y_cpu, :pred_cpu, :𝑖_, :𝑗_, :𝑖, :𝑗, :δ, :δ², :𝑤, :edges, :X_bin, :train_nodes, :splits, :hist_δ, :hist_δ², :hist_𝑤), Tuple{EvoTrees.EvoTreeRegressor{Float64, EvoTrees.Linear, Int64}, Matrix{Float64}, Vector{Float64}, Vector{StaticArrays.SVector{1, Float64}}, Vector{Int64}, Vector{Int64}, Vector{Int64}, Vector{Int64}, Vector{StaticArrays.SVector{1, Float64}}, Vector{StaticArrays.SVector{1, Float64}}, Vector{StaticArrays.SVector{1, Float64}}, Vector{Vector{Float64}}, Matrix{UInt8}, Vector{EvoTrees.TrainNode{1, Float64, Int64}}, Vector{EvoTrees.SplitInfo{1, Float64, Int64}}, Vector{Matrix{StaticArrays.SVector{1, Float64}}}, Vector{Matrix{StaticArrays.SVector{1, Float64}}}, Vector{Matrix{StaticArrays.SVector{1, Float64}}}}}, A::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}}, y::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
   @ EvoTrees ~/.julia/packages/EvoTrees/oHLKA/src/MLJ.jl:39
 [7] top-level scope
   @ REPL[183]:1

Happy to look into this.

GPU for classifiers

this feels like a deal breaker for people trying to migrate XGBoost -> EvoTrees.

What's the road blocker?

EvoTree's predict doesn't return a valid probability when max_depth is relatively high

The last line of the following block throws an exception

using Pkg;
Pkg.add(["DataFrames", "CSV", "TabularDisplay", "CategoricalArrays"]);
using DataFrames, CSV, TabularDisplay, CategoricalArrays
Pkg.add(["MLJ", "EvoTrees", "MLJScientificTypes"])
using MLJ, EvoTrees, MLJScientificTypes


num_cols = [
    "ClientPeriod",
    "MonthlySpending",
    "TotalSpent"
];



cat_cols = [
    "Sex",
    "IsSeniorCitizen",
    "HasPartner",
    "HasChild",
    "HasPhoneService",
    "HasMultiplePhoneNumbers",
    "HasInternetService",
    "HasOnlineSecurityService",
    "HasOnlineBackup",
    "HasDeviceProtection",
    "HasTechSupportAccess",
    "HasOnlineTV",
    "HasMovieSubscription",
    "HasContractPhone",
    "IsBillingPaperless",
    "PaymentMethod"
];
all_feature_cols = [num_cols; cat_cols];
target_col = "Churn";

#,types=Dict("Sex"=>CategoricalValue{String, UInt32})
# + tags=[]
df = DataFrame!(CSV.File("./train.csv",pool=0.1, missingstrings=[" "]))
categorical!(df,[cat_cols;target_col]);
describe(df,:eltype,:nunique, :nmissing)
dropmissing!(df);
describe(df,:eltype,:nunique, :nmissing)

X = df[!, all_feature_cols];
y = df[!,target_col];

mach_x = machine(ContinuousEncoder(), X)
fit!(mach_x)
X = MLJ.transform(mach_x, X)

tree_model = EvoTreeClassifier(max_depth=6, nrounds=2000,colsample=0.3)
mach = machine(tree_model, X, y)

train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split

fit!(mach, rows=train, verbosity=1)
pred_test = MLJ.predict(mach, selectrows(X, test))

The problem seems to be related to JuliaAI/MLJBase.jl#525

Here is the exception.

DomainError with Probabilities must be in [0,1].:


Stacktrace:
  [1] _err_01()
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:42
  [2] _check_probs_01(probs::Vector{Float32})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:66
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:648 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:621 [inlined]
  [5] getindex
    @ ./broadcast.jl:575 [inlined]
  [6] copy
    @ ./broadcast.jl:922 [inlined]
  [7] materialize
    @ ./broadcast.jl:883 [inlined]
  [8] UnivariateFinite(::MLJModelInterface.FullInterface, prob_given_class::OrderedCollections.LittleDict{CategoricalValue{Int64, UInt8}, AbstractVector{Float32}, Vector{CategoricalValue{Int64, UInt8}}, Vector{AbstractVector{Float32}}}; kwargs::Base.Iterators.Pairs{Symbol, Union{Missing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:pool, :ordered), Tuple{Missing, Bool}}})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:127
  [9] _UnivariateFinite(support::CategoricalVector{Int64, UInt8, Int64, CategoricalValue{Int64, UInt8}, Union{}}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}, N::Int64; augment::Bool, kwargs::Base.Iterators.Pairs{Symbol, Union{Missing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:pool, :ordered), Tuple{Missing, Bool}}})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:245
 [10] _UnivariateFinite(support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}, N::Int64; augment::Bool, pool::Missing, ordered::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:287
 [11] #_UnivariateFinite#37
    @ ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:308 [inlined]
 [12] UnivariateFinite(::MLJModelInterface.FullInterface, support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}; kwargs::Base.Iterators.Pairs{Symbol, Missing, Tuple{Symbol}, NamedTuple{(:pool,), Tuple{Missing}}})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:212
 [13] UnivariateFinite(support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}; kwargs::Base.Iterators.Pairs{Symbol, Missing, Tuple{Symbol}, NamedTuple{(:pool,), Tuple{Missing}}})
    @ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:431
 [14] predict(#unused#::EvoTreeClassifier{Float32, EvoTrees.Softmax, Int64}, fitresult::EvoTrees.GBTree{2, Float32, Int64}, A::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}})
    @ EvoTrees ~/.julia/packages/EvoTrees/L5jFX/src/MLJ.jl:56
 [15] predict(mach::Machine{EvoTreeClassifier{Float32, EvoTrees.Softmax, Int64}, true}, Xraw::DataFrame)
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/operations.jl:83
 [16] top-level scope
    @ In[22]:1
 [17] eval
    @ ./boot.jl:360 [inlined]
 [18] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

This behavior creates a real problem when doing a hyper-parameter search as per https://alan-turing-institute.github.io/MLJ.jl/stable/#Lightning-tour-1

The data is attached
train.zip

Update MLJ feature importance access to current standard

Currently feature importances from EvoTrees are accessed via report(). We should update the interface to comply with the new method defined here: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Feature-importances. In particular I suggest we do the following:

remove automatically computed feature importances from report()
dispatch MLJModelInterface.feature_importances(model::M, fitresult, report) to compute feature importances as desired
dispatch MLJModelInterface.reports_feature_importances(::Type{<:M}) = true for each EvoTrees model so MLJ knows we can access feature importances.

MLJ model registry not catching new linear model

@jeremiedb It seems the new model does not have MLJModelInterface.Model as a supertype:

using MLJModels
using EvoTrees
julia> ms = MLJModels.finaltypes(MLJModels.Model);

julia> filter(ms) do m
           Base.parentmodule(m) == EvoTrees
       end
4-element Vector{Type}:
 EvoTreeRegressor
 EvoTreeClassifier
 EvoTreeCount
 EvoTreeGaussian

MLJ interface does not see `package_name` for `EvoSplineRegressor`

EvoSplineRegressor sounds like a new model. Just updated MLJ model registry and this appeared as an "orphan" because the package_name is "unknown":

julia> info("EvoSplineRegressor")
(name = "EvoSplineRegressor",
 package_name = "unknown",
 is_supervised = true,
 abstract_type = MLJModelInterface.Deterministic,
 deep_properties = (),
 docstring = "```\nEvoSplineRegressor(; kwargs...)\n```\n\nA model t...",
 fit_data_scitype =
     Tuple{Union{ScientificTypesBase.Table{<:Union{AbstractVector{<:ScientificTypesBase.Continuous}, AbstractVector{<:ScientificTypesBase.Count}, AbstractVector{<:ScientificTypesBase.OrderedFactor}}}, AbstractMatrix{ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Continuous}},
 human_name = "evo spline regressor",
 hyperparameter_ranges = (nothing,
                          nothing,
                          nothing,
                          nothing,
                          nothing,
                          nothing,
                          nothing,
                          nothing,
                          nothing),
 hyperparameter_types = ("Int64",
                         "Symbol",
                         "Int64",
                         "Symbol",
                         "Any",
                         "Any",
                         "Union{Nothing, Dict}",
                         "Any",
                         "Symbol"),
 hyperparameters =
     (:nrounds, :opt, :batchsize, :act, :eta, :L2, :knots, :rng, :device),
 implemented_methods = [:fit, :predict, :update],
 inverse_transform_scitype = ScientificTypesBase.Unknown,
 is_pure_julia = false,
 is_wrapper = false,
 iteration_parameter = :nrounds,
 load_path = "EvoLinear.EvoSplineRegressor",
 package_license = "unknown",
 package_url = "unknown",
 package_uuid = "unknown",
 predict_scitype = AbstractVector{<:ScientificTypesBase.Continuous},
 prediction_type = :deterministic,
 reporting_operations = (),
 reports_feature_importances = false,
 supports_class_weights = false,
 supports_online = false,
 supports_training_losses = false,
 supports_weights = false,
 transform_scitype = ScientificTypesBase.Unknown,
 input_scitype =
     Union{ScientificTypesBase.Table{<:Union{AbstractVector{<:ScientificTypesBase.Continuous}, AbstractVector{<:ScientificTypesBase.Count}, AbstractVector{<:ScientificTypesBase.OrderedFactor}}}, AbstractMatrix{ScientificTypesBase.Continuous}},
 target_scitype = AbstractVector{<:ScientificTypesBase.Continuous},
 output_scitype = ScientificTypesBase.Unknown)

Maybe this issue should be transferred to EvoLinear.jl.

Feature request: support `EvoTreeClassifier(loss = Softmax())` on the GPU

General registry

@JuliaRegistrator register()

MLJ facing regressor is predicting 1 x n matrix instead of vector

julia> rgs = (@load EvoTreeRegressor)();
[ Info: For silent loading, specify `verbosity=0`. 
import EvoTrees ✔

julia> X, y = make_regression();

julia> mach = machine(rgs, X, y) |> fit!
[ Info: Training Machine{EvoTreeRegressor{Float64,…},…} @083.
Machine{EvoTreeRegressor{Float64,…},…} @083 trained 1 time; caches data
  args: 
    1:	Source @339 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @776 ⏎ `AbstractVector{Continuous}`


julia> typeof(predict(mach, rows=1:3))
Matrix{Float64} (alias for Array{Float64, 2})

julia> predict(mach, rows=1:3)
3×1 Matrix{Float64}:
  0.1765737240681419
 -0.31087790314425967
  0.4214986666800992

The MLJ API requires that predict return the same kind of object as y, ie a vector. In particular, the current implementation prevents any kind of evaluation of the model:

julia> rms(predict(mach, rows=:), y)
ERROR: MethodError: no method matching (::RootMeanSquaredError)(::Matrix{Float64}, ::Vector{Float64})
Closest candidates are:
  (::RootMeanSquaredError)(::AbstractVector{var"#s1028"} where var"#s1028"<:Real, ::AbstractVector{var"#s1027"} where var"#s1027"<:Real) at /Users/anthony/.julia/packages/MLJBase/j0qGA/src/measures/continuous.jl:78
  (::RootMeanSquaredError)(::AbstractVector{var"#s1028"} where var"#s1028"<:Real, ::AbstractVector{var"#s1027"} where var"#s1027"<:Real, ::AbstractVector{var"#s1026"} where var"#s1026"<:Real) at /Users/anthony/.julia/packages/MLJBase/j0qGA/src/measures/continuous.jl:88
Stacktrace:
 [1] top-level scope
   @ REPL[21]:1

cc @jeremiedb @lhnguyen-vn

This is holding up: JuliaAI/DataScienceTutorials.jl#162

Getting a warning when using multiple features

using EvoTrees
using EvoTrees: sigmoid, logit
using MLJBase
using RDatasets

iris = dataset("datasets", "iris")
iris[!, :is_setosa] = iris[!, :Species] .== "setosa"
features = setdiff(names(iris), ["Species", "is_setosa"])

Y, X, _ = unpack(iris, ==(:is_setosa), in(Symbol.(features)), colname -> true)
train, test = partition(eachindex(Y), 0.7, shuffle=true); # 70:30 split
tree_model = EvoTreeClassifier(
    loss=:linear, metric=:mse,
    nrounds=100, nbins = 100,
    λ = 0.5, γ=0.1, η=0.1,
    max_depth = 6, min_weight = 1.0,
    rowsample=0.5, colsample=1.0)
mach = machine(tree_model, X, Y)

results in this warning

Warning: The number and/or types of data arguments do not match what the specified model supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc EvoTreeClassifier{Float64, EvoTrees.Softmax, Int64}` to learn more about your model's requirements.
│ Commonly, but non exclusively, supervised models are constructed using the syntax `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are constructed with `machine(model, X)`. Here `X` are features, `y` a target, and `w` sample or class weights.

Why could this be happening?

Export to XGBoost or JSON or Text file

for some ungodly reason we need to let other people use our model and they don't use Julia everywhere

Create MLJ-compliant doc strings

@jeremiedb Saw your recent doc PR and thought I'd flag this. I understand you may not have the bandwidth just now, but should be aware there is now an "official" format for these strings, so you don't work against it unwittingly.

Raise lower bound: [compat] MLJModelInterface = "^0.3"

This will allow your classifier to buy into this performance improvement. Apart from the [compat] update, the only other step should be to replace the following line

EvoTrees.jl/src/MLJ.jl

Line 46 in 4235aa2

    
           return [MLJModelInterface.UnivariateFinite(fitresult.levels, pred[i,:]) for i in 1:size(pred,1)]

with

    return MLJModelInterface.UnivariateFinite(fitresult.levels, pred)

which returns an instance of an abstract vector of eltype <: UnivariateFinite instead of a vanilla vector of the same eltype. You should not need to change any tests - this object should have all the same behaviour as the old one.

FYI: I am releasing today an MLJModels update which will also have MLJModelInterface 0.3 as a lower bound (and performance buy-in for all the classifiers with implementations there).

Let me know if you have questions.

Support new MLJ iteration interface

Training losses don't seem to be exposed in the MLJ interface. This would be nice. I'm close to finishing up an iterative control wrapper for MLJ models and if they report training losses, one can apply "training progress" modified stopping criterion, such as PQ

Bump [compat] MLJModelInterface="^0.1, ^0.2"

Here the changes since 0.1.8, which I doubt break anything in EvoTrees:

Bring in is_same_except and similar gizmos from MLJBase; see JuliaAI/MLJBase.jl#178
Let all models share same traits (#19, JuliaAI/MLJBase.jl#163)
Add save and restore stubs for serialisation

If you wanted to make use of is_same_except you would need the more restrictive

[compat] MLJModelInterface="^0.2"

how to free gpu memory after training with MLJ interface

running

mach = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y, cache=false)
mach1 = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y1, cache=false)
mach2 = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y2, cache=false)
...

could add to gpu memory pool usage by several GBs after each line run. Is it possible to free everything used in GPU training as I would only need CPU when prediction?

Feature Request: Learning to Rank (LTR)

Is there any chance you can add something similar to lightgbm.LGBMRanker()? There doesn't seem to be any way of doing this easily in Julia without having to use PyCall and CatBoost...

https://en.wikipedia.org/wiki/Learning_to_rank

Verbosity vs Initial tracking info

Hello,

runinng julia 1.8.2, EvoTrees 0.12.2

In fit_evotree - shouldn't be printing "Initial tracking info" hidden behing "if verbosity > 0 end"?

Cheers
Lubo

method error with importance()

I have two models stacked, one of each is an ensemble of EvoTreeClassifier models, and I am working with MLJ.

I want to know the feature importance according to the ensemble of EvoTreeClassifiers, but I am getting the following error:

julia> f = fitted_params(tuned_stack)
julia> features_gain = importance(f.best_fitted_params.model2.fitresult, wvn_table)
ERROR: MethodError: no method matching importance(::MLJ.WrappedEnsemble{EvoTrees.GBTree{2,Float32,Int64},EvoTreeClassifier{Float32,EvoTrees.Softmax,Int64}}, ::Array{Any,1})
Closest candidates are:
  importance(::EvoTrees.GBTree, ::AbstractArray{T,1} where T) 
at C:\Users\ivica\.julia\packages\EvoTrees\9NdxH\src\importance.jl:12

For clarity, the output of f.best_fitted_params.model2.fitresult is

julia> f.best_fitted_params.model2.fitresult
WrappedEnsemble{GBTree{,…},…} @546

I understand that I have to feed an object of the type ::EvoTrees.GBTree to the importance() function... How could I do so when my model is an ensemble of ::EvoTrees.GBTrees?

Perf improvements

As discussed in #100

Identified largest child node to be subject to Histogram subtration (rather that fixed to right child)
Threaded gradient update & eval
Improved binning pre-processing

Feature request: Terminal Leafs Cubist

Hi and thank you for your package!
I love how flexible it is.

I've had good results w/ Cubist models: in which terminal leaves contain linear regression models, as opposed to simple averages.

Would it be hard to include this type of option in EvoTrees?

Gini metric

Multiple metrics tracking

deploy

How do you deploy? Consider adding save/load functions? Or be able to export to some format of other packages like lightgbm/xgboost to leverage the package dmlc/treelite: model compiler for decision tree ensembles?

evovest / evotrees.jl Goto Github PK

evotrees.jl's People

Contributors

Stargazers

Watchers

Forkers

evotrees.jl's Issues

Training model

Recommend Projects

Recommend Topics

Recommend Org