modeloriented / dalex Goto Github PK

moDel Agnostic Language for Exploration and eXplanation

License: GNU General Public License v3.0

R 25.24% Python 74.76%

machine-learning interpretability data-science xai iml model-visualization dalex explanations explainable-ai explainable-artificial-intelligence

dalex's Introduction

moDel Agnostic Language for Exploration and eXplanation

Overview

Unverified black box model is the path to the failure. Opaqueness leads to distrust. Distrust leads to ignoration. Ignoration leads to rejection.

The DALEX package xrays any model and helps to explore and explain its behaviour, helps to understand how complex models are working. The main function explain() creates a wrapper around a predictive model. Wrapped models may then be explored and compared with a collection of local and global explainers. Recent developents from the area of Interpretable Machine Learning/eXplainable Artificial Intelligence.

The philosophy behind DALEX explanations is described in the Explanatory Model Analysis e-book. The DALEX package is a part of DrWhy.AI universe.

If you work with scikit-learn, keras, H2O, tidymodels, xgboost, mlr or mlr3 in R, you may be interested in the DALEXtra package, which is an extension of DALEX with easy to use explain_*() functions for models created in these libraries.

Additional overview of the dalex Python package is available.

Installation

The DALEX R package can be installed from CRAN

install.packages("DALEX")

The dalex Python package is available on PyPI and conda-forge

pip install dalex -U

conda install -c conda-forge dalex

Learn more

Machine Learning models are widely used and have various applications in classification or regression tasks. Due to increasing computational power, availability of new data sources and new methods, ML models are more and more complex. Models created with techniques like boosting, bagging of neural networks are true black boxes. It is hard to trace the link between input variables and model outcomes. They are use because of high performance, but lack of interpretability is one of their weakest sides.

In many applications we need to know, understand or prove how input variables are used in the model and what impact do they have on final model prediction. DALEX is a set of tools that help to understand how complex models are working.

Resources

Gentle introduction to DALEX with examples in R and Python

R package

Introduction to Responsible Machine Learning @ useR! 2021
DALEX + mlr3 @ BioColl 2021 & @ Open-Forest-Training 2021
Materials from Explanatory Model Analysis Workshop @ eRum 2020, cheatsheet
How to use DALEX with: keras, parsnip, caret, mlr, H2O, xgboost
Compare GBM models created in different languages: gbm and CatBoost in R / gbm in h2o / gbm in Python
DALEX for fraud detection
DALEX for teaching
XAI in the jungle of competing frameworks for machine learning

Python package

Introduction to the dalex package: Titanic: tutorial and examples
Key features explained: FIFA20: explain default vs tuned model with dalex
How to use dalex with: xgboost, tensorflow
More explanations: residuals, shap, lime
Introduction to the Fairness module in dalex
Introduction to the Arena: interactive dashboard for model exploration
Code in the form of jupyter notebook
Changelog: NEWS

Talks about DALEX

Citation

If you use DALEX in R or dalex in Python, please cite our JMLR papers:

@article{JMLR:v19:18-416,
  author  = {Przemyslaw Biecek},
  title   = {DALEX: Explainers for Complex Predictive Models in R},
  journal = {Journal of Machine Learning Research},
  year    = {2018},
  volume  = {19},
  number  = {84},
  pages   = {1-5},
  url     = {http://jmlr.org/papers/v19/18-416.html}
}

@article{JMLR:v22:20-1473,
  author  = {Hubert Baniecki and
             Wojciech Kretowicz and
             Piotr Piatyszek and 
             Jakub Wisniewski and 
             Przemyslaw Biecek},
  title   = {dalex: Responsible Machine Learning 
             with Interactive Explainability and Fairness in Python},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {214},
  pages   = {1-7},
  url     = {http://jmlr.org/papers/v22/20-1473.html}
}

Why

76 years ago Isaac Asimov devised Three Laws of Robotics: 1) a robot may not injure a human being, 2) a robot must obey the orders given it by human beings and 3) A robot must protect its own existence. These laws impact discussion around Ethics of AI. Today’s robots, like cleaning robots, robotic pets or autonomous cars are far from being conscious enough to be under Asimov’s ethics.

Today we are surrounded by complex predictive algorithms used for decision making. Machine learning models are used in health care, politics, education, judiciary and many other areas. Black box predictive models have far larger influence on our lives than physical robots. Yet, applications of such models are left unregulated despite many examples of their potential harmfulness. See Weapons of Math Destruction by Cathy O'Neil for an excellent overview of potential problems.

It's clear that we need to control algorithms that may affect us. Such control is in our civic rights. Here we propose three requirements that any predictive model should fulfill.

Prediction's justifications. For every prediction of a model one should be able to understand which variables affect the prediction and how strongly. Variable attribution to final prediction.
Prediction's speculations. For every prediction of a model one should be able to understand how the model prediction would change if input variables were changed. Hypothesizing about what-if scenarios.
Prediction's validations For every prediction of a model one should be able to verify how strong are evidences that confirm this particular prediction.

There are two ways to comply with these requirements. One is to use only models that fulfill these conditions by design. White-box models like linear regression or decision trees. In many cases the price for transparency is lower performance. The other way is to use approximated explainers – techniques that find only approximated answers, but work for any black box model. Here we present such techniques.

Acknowledgments

Work on this package was financially supported by the NCN Opus grant 2016/21/B/ST6/02176 and NCN Opus grant 2017/27/B/ST6/01307.

dalex's People

Contributors

Stargazers

Watchers

Forkers

hemanth-sindhanuru benzei oppa3109 ms1948 kashenfelter aleksandradabrowska radovankavicky gapdata jeancroy nanaakwasiabayieboateng gse-cc-git baifengbai brainprint mtoto gridl micseb smartgamer nemochina2008 ledell vmeli aboubek vishalbelsare ml-lab kevinykuo day15 12tafran jmeller krystian8207 elinama arlugones vivek2319 sakampavankumar statmixedml bkbonde navdeep-g shubhampachori12110095 yushu-liu awesome-archive zhongxingpeng teng-gao szczybur imarcello minghao2016 stjordanis kmatusz wskwon cyklang mejihero crackend nkhuyu konradbachusz-zz mutual-ai justin2061 ywang021 chengjingfeng guokai8 gusbruschi13 haroon123 hubbucket-team rafrodriguez elfedorova terrynema04 agosiewska abenavs gakkilovemath sirnyls razielar dawidkopczyk cristian-dinu-69 bedantaguru wojciechkretowicz a3digit robertus100 sarahiaguilar bbqgonewrong justinmshea rchddeg satopan piotrpiatyszek pfraczek ankitshah009 kasiapekala jakwisn bramamoorthy serviolimareina maksymiuks milicanikolic yadevi sandy4321 nsood-ai rab657 sailfish009 yawomkobara migueldb 321hg shalevy1 minalspatil royalts mehrdad-moradi haiyuni

dalex's Issues

generic `predict()` for `explainer` object

It would be useful to have predict.explainer() function that calls predict_function() from explainer.

Adding support for h2o models?

Would DALEX is going to support h2o models?

Currently, H2OFrame is not supported when running single_variable.

Model.frame shouldn't be default training data

Hi.

If data parameter is not passed to explain function, it is extracted by default from model as model.frame(data) (if possible). The assumption here is that data should be training data used by model.
In some cases it's not true. Let's consider

model <- glm(log(qsec) ~ exp(drat) + hp, data = mtcars)

In here model.frame(model) stores transformed variables, ie.

colnames(model.frame(model)) == c('log(qsec)', 'exp(drat)', 'hp')

I think the best way is to use by default:

eval(stats::getCall(model)$data)

As it uses envir = parent.frame() by default it should source training data that was used in model call.

Error with single_prediction function: Error in UseMethod("broken")

When I try to run the vignette examples for the single_prediction() function I see the following error for the random forest model:

Error in UseMethod("broken") :
no applicable method for 'broken' applied to an object of class "c('randomForest.formula', 'randomForest')"

I have DALEX package version 0.1 and breakDown 0.1.3

Add travis support

create roadmap for DALEX developement

prepare a plan for next releases of DALEX
set some functions as deprecated
set goals to limit number of dependencies
define how DALEX will switch to ingredients and iBreakDown
add CONTRIBUTING.md file (maybe based on survxai)

New theme_drwhy

As in the ingredients and iBreakDown

Add a licence file please

Thank you for your excellent work.

I am currently thinking about using it / recommending it to clients / contributing to it, but the fact that no explicit licence is chosen in this repo yet makes that organisationally more difficult for me (and other companies) than it could be.

Clients also have issues (reasonable and unreasonable ones) with the GPL licence. So maybe it could be something else?

https://help.github.com/articles/adding-a-license-to-a-repository/

Best regards,
Frank

Constraints in the features contribution calculation?

Hi,

Is there a way to enforce constraints in the features contribution calculation resulting from prediction_breakdown , for example to enforce some features to have positive contribution?

Please note I'm already using monotonicity constraints in the xgboost training.

Thanks

Problems with variable_response for xgboost model when variable is a factor (subscripts out of bonds)

I got an error message when trying to extract the variable response for factor with a xgboost model.

library(DALEX)
library(breakDown)
library(xgboost)

data(HR_data)

model_matrix_train <- model.matrix(left ~ . - 1, HR_data)
data_train <- xgb.DMatrix(model_matrix_train, label = HR_data$left)
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
objective = "binary:logistic", eval_metric = "auc")

HR_xgb_model <- xgb.train(param, data_train, nrounds = 50)
HR_xgb_model

predict_logit <- function(model, x) {
raw_x <- predict(model, x)
exp(raw_x)/(1 + exp(raw_x))
}
logit <- function(x) exp(x)/(1+exp(x))

explainer_xgb <- explain(HR_xgb_model,
data = model_matrix_train,
y = HR_data$left,
predict_function = predict_logit,
link = logit,
label = "xgboost")
explainer_xgb

x_rv <- variable_response(explainer_xgb, variable = "salary", type = "factor")

x_rv <- variable_response(explainer_xgb, variable = "salary", type = "factor")
Error in explainer$data[, variable] : subscript out of bounds

CRAN version

0.2.4 will go to CRAN since the HR data is required for ceterisParibus plots.
Any other fixes should go with this version?

remove hard dependencies from pdp, ALEPlot and factorMerger

move these packages to Suggested

variable_response for a factor with 2 levels

I have a problem to understand a variable_response plot when the explanatory variable is a factor and has 2 levels.

library(DALEX)
library(carData)
library(randomForest)

data("Leinhardt")
df <- Leinhardt %>% select(infant, income, region, oil)
df <- na.omit(df)

rf2 <- randomForest(infant ~ income + region + oil, data = df)

rf2_exp <- DALEX::explain(rf2, data = df, y = df$infant, label = "rf")

rf2_rv <- variable_response(rf2_exp, variable = "oil", type = "factor")

plot(rf2_rv)

explain() does not work

I use the following code :
wineLmModel <- lm(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine) wineLmExplainer <- explain(wineLmModel)

Error in UseMethod("explain") :
no applicable method for 'explain' applied to an object of class "lm"

Merging Path Plots Are too Slowly

when the variable type is categorical to plot use single_variable is become slowly !

Feature request: ALE Second Order Plots to investigate interactions

Hi,

Firstly, thank you very much for the package and the extensive tutorials.

In the library iml you are able to interrogate your model for interactions based on the amount of variance explained. I would be interested in being able to review an ALE plot in relation to these interactions with respect to the model output. I see in the vignette on page 10 for the package ALEPlot that they seem to have implemented this but I have been unsuccessful in getting it to function consistently with a 2 class classification problem using caret.

Are there plans to implement anything similar in DALEX

Thank you very much for your time

Old functions names in cheatsheets

There are: single_variable(), single_prediction(), and variable_dropout()
instead of variable_response(), prediction_breakdown(), and variable_importance().

higher dimensions for Y

DALEX should support cases in which the predict function returns more than a single column
Think about multi class classification of multivariate regression.
It should be handled in the same way as multiple models

n.trees error with variable_importance() on gbm model

I get an n.trees error when trying to use the variable_importance() function on an explainer for a gbm model created with the gbm package:

library(gbm)
library(DALEX)

mod <- gbm(m2.price~.,data = apartments, distribution = "gaussian")

exp.mod <- explain(mod, data = apartmentsTest[,2:6],
                   y = apartmentsTest$m2.price)

vi <- variable_importance(exp.mod, loss_function = loss_root_mean_square)

Error in paste("Using", n.trees, "trees...\n") : 
  argument "n.trees" is missing, with no default

vi <- variable_importance(exp.mod, loss_function = loss_root_mean_square, n.trees = 2000)

Error in paste("Using", n.trees, "trees...\n") : 
  argument "n.trees" is missing, with no default

Values in plots for variable_dropout are not sorted

In the plots for variable_dropout_explainers for multiple models, the variables are not sorted by the Drop-out loss value. This can be seen in the example from the cheatsheet.

Perhaps it would be worth sorting out them?
When a single model is plotted, the values are sorted.

Error in `[.data.frame`(new_observation, colnames(ny)) : undefined columns selected

I use a Mc Os with S.O. High Sierra: Version 10.13.3
Rstudio: Version 1.1.4.23
library(breakDown) Version 0.1.5
library(DALEX) Version 0.1.1

if I run your example:

library("breakDown")
new.wine <- data.frame(citric.acid = 0.35,sulphates = 0.6,alcohol = 12.5,pH = 3.36, residual.sugar = 4.8)
wine_lm_model4 <- lm(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine)
wine_lm_explainer4 <- explain(wine_lm_model4, data = wine, label = "model_4v")
wine_lm_predict4 <- single_prediction(wine_lm_explainer4, observation = new.wine)
plot(wine_lm_predict4)
all works fine.

But if I use my data
....
a=read.xls('new_longitudinali.xls',sheet=2)
BG=as.numeric(a$Modello)
b=data.frame(a[2:12],BG)
attach(b)
new.b <- data.frame(ArGoMe <=129.6,
CoGo<=47.4,
GoGn <=72.3,
Nme <=104.5,
NSAr <= 121.5 ,
PPMP <= 27.1,
PPSN <= 8.7,
Sar <= 29.2,
SN <= 63.8,
SNA <= 79.6,
SNB <=79.9,
)
b_lm_model<- lm(BG ~ArGoMe+CoGo+GoGn+Nme+NSAr+PPMP+PPSN+Sar+SN+SNA+SNB , data = b)
b_lm_explainer <- explain(b_lm_model, data = b, label = "model_4v")
b_lm_predict <- single_prediction(b_lm_explainer, observation = new.b)

the “single_prediction” function give me back the following error:

Error in [.data.frame(new_observation, colnames(ny)) :
undefined columns selected

How can I solve it ?

Thanks in advance for your help.

Adding plot functionality to model_performance

When plotting from model_performance function, would it be possible to add functionality to limit x-axis values, as well as facets by some model factors to try to drill down into specific factors that drive the overall residuals?

Apologies in advance if these functionalities already exists. #Beginnerhere

Function names

Consider following changes in names
single_variable -> variable_response
single_prediction -> prediction_decomposition
variable_importance -> variable_leverage

Incorrect value for `type` parameter in `variable_importance()`

While passing a wrong parameter value, raw drop losses are calculated.

It would be helpful if variable_importance() return an error or a warning with information that type = "raw" was taken.

library(randomForest)

model_regr_rf <- randomForest(m2.price~., data = apartments, ntree = 50)
explainer_regr_rf <- explain(model_regr_rf, data = apartmentsTest[1:1000, ], y = apartmentsTest$m2.price[1:1000])

variable_importance(explainer_regr_rf, type="anything")

`variable_response()` doesn't use `predict_function`.

No matter which predict_function was passed to explain(), the results of the PDP plots in variable_response are the same.

 library(breakDown)
 library(randomForest)
 data(HR_data)
 HR_rf_model <- randomForest(factor(left)~., data = breakDown::HR_data, ntree = 100)
 explainer_rf  <- explain(HR_rf_model, data = HR_data,
                          predict_function = function(model, x) predict(model, x, type = "prob")[,2])
 expl_rf  <- variable_response(explainer_rf, variable = "satisfaction_level", type = "pdp")
 plot(expl_rf)
 
 
 explainer_rf_constant  <- explain(HR_rf_model, data = HR_data,
                          predict_function = function(model, x) return(0.5))
 expl_rf_constant  <- variable_response(explainer_rf_constant, variable = "satisfaction_level", type = "pdp")
 plot(expl_rf_constant)

Plots are the same, but should be different for different predict functions.
I think the solution is passing predict_function to pdp:: partial() by pred.fun parameter.

Same problem for ALE plots.

How to interpret `baseline` in `variable_importance()` plots?

How do we interpret the baseline in variable importance? Noticed that when the variable importances for different models are plotted simultaneously that the baseline numbers don't agree, even with type = "ratio", n_sample = -1.

Function names in DALEX 0.2.0

Current names are chaotic.
Here are propositions for new names. Old names will stay as deprecated.

variable_dropout() -> variable_importance()
single_variable() -> variable_response()
single_prediction() -> prediction_breakdown()

New names are more consistent with planned: outlier_detection(), model_performance()

single_variable with neuralnet( ) model

The package is great for teaching purposes. Sadly it seems (a priori!) that the function single_variable() doesn't work with neuralnet( ) model.

Here a reproducible example taken from the vignette adding a neuralnet( ) model:

set.seed(13)
N <- 250
X1 <- runif(N)
X2 <- runif(N)
X3 <- runif(N)
X4 <- runif(N)
X5 <- runif(N)

f <- function(x1, x2, x3, x4, x5) {
((x1-0.5)2)^2-0.5 + sin(x210) + x3^6 + (x4-0.5)2 + abs(2x5-1)
}
y <- f(X1, X2, X3, X4, X5)

library(randomForest)
library(DALEX)
library(e1071)
library(rms)
library(neuralnet)

df <- data.frame(y, X1, X2, X3, X4, X5)

model_rf<-randomForest(y~., df)
model_svm<-svm(y~., df)
model_lm<-lm(y~., df)
model_nn<-neuralnet(y~X1+X2+X3+X4+X5,df,hidden=1)

dd <- datadist(df)
options(datadist="dd")
model_rms <- ols(y ~ rcs(X1) + rcs(X2) + rcs(X3) + rcs(X4) + rcs(X5), df)

ex_rf<-explain(model_rf)
ex_svm<-explain(model_svm)
ex_lm<-explain(model_lm)
ex_nn<-explain(model_nn)
ex_rms<-explain(model_rms, label = "rms", data = df[, -1], y = df$y)
ex_tr<-explain(model_lm, data = df[,-1],
predict_function = function(m, x) f(x[,1], x[,2], x[,3], x[,4], x[,5]),
label = "True Model")

library(ggplot2)
plot(single_variable(ex_rf, "X1"),
single_variable(ex_svm, "X1"),
single_variable(ex_lm, "X1"),
single_variable(ex_nn, "X1"),
single_variable(ex_rms, "X1"),
single_variable(ex_tr, "X1")) +
ggtitle("Responses for X1. Truth: y ~ (2*x1 - 1)^2")

bug when data was not provided in the explain()

The description of the argument data in explain() function is:

data - data.frame or marix - data that was used for fitting. If not provided then will be extracted from model fit

But if is not provided explainer$data is NULL.
And then function single_variable gives an error.

Example code:

library(DALEX)

input <- mtcars[,c("am","cyl","hp","wt")]
model.glm = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)

explainer_glm <- explain(model.glm)
expl_glm <- single_variable(explainer_glm, variable = "wt", type="pdp")

The code above returns the error:
Error in partial.default(explainer$model, pred.var = variable, train = explainer$data, :
wt not found in the training data.

But it works with provided argument data.

explainer_glm <- explain(model.glm, data = input)
expl_glm <- single_variable(explainer_glm, variable = "wt", type="pdp")

Test frame as tibble - wrong calculations for variable importance and confusing warning messages

If you do have test frame as tibble (easy to get when using tidyverse)

For calculation of variable importance you get the same values for full_model and all variables except baseline, which is obviously wrong.
variable dropout_loss label
1 full_model 284.9159 lm
2 construction.year 284.9159 lm
3 surface 284.9159 lm
4 floor 284.9159 lm
5 no.rooms 284.9159 lm
6 district 284.9159 lm
7 baseline 1261.6643 lm

For single_variable calculations you get following warning (only), however output is of limited value.
Warning message:
In if (class(explainer$data[, variable]) == "factor" & type != "factor") { :
the condition has length > 1 and only the first element will be used

Casting tibble to regular data.frame solves the issue. Having training data as tibble seems not to have an impact on calculations at all.

`apartmentsTest_tibble <- apartmentsTest %>% as_tibble()

model_liniowy <- lm(m2.price ~ construction.year + surface + floor + no.rooms + district, data = apartments)

explainer_lm <- explain(model_liniowy, data = apartmentsTest_tibble[,2:6], y = apartmentsTest_tibble$m2.price)

vi_lm <- variable_importance(explainer_lm, loss_function = loss_root_mean_square)
vi_lm

sv_lm <- single_variable(explainer_lm, variable = "construction.year", type = "pdp")`

Allow inverse sorting in plot.variable_importance_explainer

For measures that are maximized like AUC

Did the prescribed: devtools::install_github("pbiecek/DALEX") but Rstudio (latest version under Linux), will not complete the DALEX install. After the above command, R will just hang and nothing happens...

Did the prescribed:
devtools::install_github("pbiecek/DALEX")
but Rstudio (latest version under Linux),
will not complete the DALEX install.

After the above command,
R will just hang and nothing happens...
R will just hang there.

Waited for 5 minutes,
then suspended the installation.

What am I missing
in order to install DALEX ?.
(all other R packages have installed with no problems...).

Thanks!

v 0.1.1 is going to CRAN

Do we need to fix/add anything before this will be submitted?
Shapley values will get to DALEX in the next version
Support for mlr as well

Publications, proceedings and books about DALEX

Are there any scientific (peer-reviewed) publications, conference proceedings or books about DALEX and related packages of the same authors that I could cite in a scientific work? Is there a place where I could find a comprehensive list of those publications?

It would be nice if this list was also included on the website of DALEX.

"Vectorized" single_prediction

Wouldn't it be good to let "observation" argument be a list/a df of observations to explain?

the explain function in the dplyr package

The explain() method exists also in the dplyr package as the generic function.
DALEX will behave differently depending on which package is loaded first (DALEX / dplyr)
not clear how to solve this

Add usecase how to use DALEX with parsnip models

parsnip is a new model factory
https://tidymodels.github.io/parsnip/index.html

would be nice to have a vignette that shows how to use DALEX with parsnip models

Number of observations for PDP

Maybe adding small barplots under PDP curve indicating how many observations for a given x value are in the dataset would be useful?

plot.model_performance_explainer outliers' labels depend on the order of model input

Hi,

Following the example on https://pbiecek.github.io/DALEX/reference/plot.model_performance_explainer.html , if you rearrange the order of arguments from plot(mp_rf, mp_glm, mp_lm, geom = "boxplot", show_outliers = 1) to plot(mp_glm, mp_lm, mp_rf, geom = "boxplot", show_outliers = 1), you will get a graph where the outliers don't match the model.

It seems like we have to input the models best to worst in terms of root mean square of residuals for it for the outliers' label to match the model.

n.trees error with variable_response() function on gbm object

This issue is related to a previous one: #4

I get an n.trees error even when passing the n.trees argument to variable_response() with an explainer created on a gbm model object.

library(gbm)
library(DALEX)
library(breakDown)

# create a gbm model
model <- gbm(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine,
             distribution = "gaussian",
             n.trees = 1000,
             interaction.depth = 4,
             shrinkage = 0.01,
             n.minobsinnode = 10,
             verbose = FALSE)

# make an explainer for the model
explainer_gbm <- explain(model, data = wine)

# single variable
exp_sgn <- variable_response(explainer_gbm, variable = "alcohol", n.trees = 1000)

Error in paste("Using", n.trees, "trees...\n") : 
  argument "n.trees" is missing, with no default](url)

Add tests

prediction_breakdown() plot result may be confusing with default `baseline` choice

New to this so please let me know if I'm misinterpreting the functionality!

Currently, the baseline argument to broken() is hardcoded to be "Intercept" and there is no way to modify it. This parameter should be exposed, and the default may lead to confusion because one would expect "final_prognosis" to be equal to the prediction, at least for models with an identity link function. Also, the plot method doesn't tell you what the baseline is, so it's difficult to tell the story of how we got to the prediction (since we don't see the prediction on the plot.)

Issue with parsnip fitted xgboost model

I have fitted an xgboost model.

Use of the object and its fit field in explain() function of DALEX is not possible.

xgb= boost_tree(mode = "regression") %>%
  set_engine(engine = "xgboost") %>%
  fit(formula = mpg ~ ., mtcars)

expl= explain(xgb, data = mtcars, y = mtcars$mpg)
variable_importance(expl)

[Error in xgb.DMatrix(newdata, missing = missing) : 
  xgb.DMatrix does not support construction from list](url)

expl= explain(xgb$fit, data = mtcars, y = mtcars$mpg)
variable_importance(expl)

Error in xgb.DMatrix(newdata, missing = missing) : 
  xgb.DMatrix does not support construction from list

Any tips to achieve interop?
(tidymodels/parsnip#127)

migration from BreakDown to BreakDown2 in DALEX v0.4

Advantages:

lower complexity O(p) instead of O(p^2) for additive attribution
identification of interactions
support for D3 visualisation

Disadvantages

change in the interface
lack of direct support for lm/glm models, only model agnostic approach will be available

Missspelling in article

In article https://pbiecek.github.io/DALEX/articles/DALEX_and_xgboost.html
there is a missspelling
model_martix_train
it shoul be model_matrix_train

prediction_breakdown for linear model with splines

Hi,
just wondering how easy it would be to allow prediction_breakdown to work with a linear model when you use a spline term in the predictor. Here is my example:

apart.lm <- lm(m2.price ~ ns(construction.year, df=5) + surface + floor + no.rooms + district,
               data=apartments)
aplm.ex <- explain(apart.lm, data=apartmentsTest[, 2:6], y = apartmentsTest$m2.price)
new_apartment <- apartmentsTest[1,]
aplm.bd <- prediction_breakdown(aplm.ex, observation=new_apartment)

Gives the error:

Error in `[.data.frame`(new_observation, colnames(ny)) : 
  undefined columns selected

I guess I could take the ns part out and use the generated basis function, but it would be convenient if you didn't have to do this, particularly if you wanted to compare, eg, models with different basis functions.

Robert

single_variable should support factor variables

with the use of factorMerger package!

Error in single_prediction with xgboost model

There is error while running single_prediction() function with xgboost model:

Error in new_observation[rep(1, nrow(data)), ] : incorrect number of dimensions

Does this function support xgb.Booster model types?

I have DALEX_0.1.8 and breakDown_0.1.4.

examples from rms package?

Hi,

Great package and contribution to the understanding of statistical / ML models. Would you be interested in including some examples that use the rms family of models? The rms family of models (e.g. ols) and associated predict methods simplify the process of integrating basis function expansion (e.g. restricted cubic basis functions) into linear models.

Here is an adaptation of one of your vignettes. Note that with the default number of rcs terms (4 knots) the linear model predictions are 99% identical to source data.

library(randomForest)
library(DALEX)
library(e1071)
library(rms)
library(ggplot2)


set.seed(13)
N <- 250
X1 <- runif(N)
X2 <- runif(N)
X3 <- runif(N)
X4 <- runif(N)
X5 <- runif(N)

f <- function(x1, x2, x3, x4, x5) {
  res <- ((x1-0.5)*2)^2-0.5 + sin(x2*10) + x3^6 + (x4-0.5)*2 + abs(2*x5-1)
  return(res)
}

y <- f(X1, X2, X3, X4, X5)

df <- data.frame(y, X1, X2, X3, X4, X5)

## important setup step required for use of rms functions
dd <- datadist(df)
options(datadist="dd")

model_rf <- randomForest(y~., df)
model_svm <- svm(y ~ ., df)

## add rcs terms to linear model
## this is a very convenient, objective way to account for non-linearity
## still a "linear" model because terms are linear combinations (additive)
model_lm <- ols(y ~ rcs(X1) + rcs(X2) + rcs(X3) + rcs(X4) + rcs(X5), df)

ex_rf <- explain(model_rf)
ex_svm <- explain(model_svm)
ex_tr <- explain(model_lm, data = df[,-1], 
                 predict_function = function(m, x) f(x[,1], x[,2], x[,3], x[,4], x[,5]), 
                 label = "True Model")

## seems that the `y` argument is required here
ex_lm <- explain(model_lm, data = df[, -1], y = df$y)


plot(single_variable(ex_rf, "X1"),
     single_variable(ex_svm, "X1"),
     single_variable(ex_lm, "X1"),
     single_variable(ex_tr, "X1")) +
  ggtitle("Responses for X1. Truth: y ~ (2*x1 - 1)^2")


plot(single_variable(ex_rf, "X2"),
     single_variable(ex_svm, "X2"),
     single_variable(ex_lm, "X2"),
     single_variable(ex_tr, "X2")) +
  ggtitle("Responses for X2. Truth: y ~ sin(10 * x2)")


plot(single_variable(ex_rf, "X3"),
     single_variable(ex_svm, "X3"),
     single_variable(ex_lm, "X3"),
     single_variable(ex_tr, "X3")) +
  ggtitle("Responses for X3. Truth: y ~ x3^6")


plot(single_variable(ex_rf, "X4"),
     single_variable(ex_svm, "X4"),
     single_variable(ex_lm, "X4"),
     single_variable(ex_tr, "X4")) +
  ggtitle("Responses for X4. Truth: y ~ (2 * x4 - 1)")


plot(single_variable(ex_rf, "X5"),
     single_variable(ex_svm, "X5"),
     single_variable(ex_lm, "X5"),
     single_variable(ex_tr, "X5")) +
  ggtitle("Responses for X5. Truth: y ~ |2 * x5 - 1|")

Installation problem on Windows 10 machine

Hi,

I tried to install both the CRAN and the GitHub version of DALEX, but I keep getting the following error

Error : .onLoad failed in loadNamespace() for 'sf', details:
  call: get(genname, envir = envir)
  error: object 'group_map' not found

The error seem to appear when installing the factorMerger library.
Any help appreciated to debug.

modeloriented / dalex Goto Github PK

dalex's Introduction

moDel Agnostic Language for Exploration and eXplanation

Overview

Installation

Learn more

Resources

R package

Python package

Talks about DALEX

Citation

Why

Acknowledgments

dalex's People

Contributors

Stargazers

Watchers

Forkers

dalex's Issues

Recommend Projects

Recommend Topics

Recommend Org