csinva / imodels Goto Github PK

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).

License: MIT License

Jupyter Notebook 79.13% Python 20.87%

interpretability machine-learning data-science artificial-intelligence ml ai statistics scikit-learn python optimal-classification-tree

imodels's Introduction

Hi there 👋 I'm Chandan, a Senior Researcher at Microsoft Research working on interpretable machine learning.
Homepage / Twitter / Google Scholar / LinkedIn

🌳 Interpretable models / dataset explanations

Interpretable and accurate predictive modeling, sklearn-compatible (JOSS 2021). Contains FIGS (arXiv 2022) and HSTree (ICML 2022)

Interpretability for text. Contains Aug-imodels (Nature Communications 2023) , iPrompt (ICLR workshop 2023) , SASC (arXiv 2023) , and Tree-Prompt (EMNLP 2023)

adaptive-wavelets Adaptive, interpretable wavelets (NeurIPS 2021)

🤖 General-purpose AI packages and cheatsheets

Notes and resources on AI

Utilities for trustworthy data-science (JOSS 2021)

🧠 Interpreting neural networks

deep-explanation-penalization Penalizing neural-network explanations (ICML 2020)

hierarchical-dnn-interpretations Hierarchical interpretations for neural network predictions (ICLR 2019)

transformation-importance Feature importance for transformations (ICLR Workshop 2020)

📊 Data-science problems

covid19-severity-prediction Extensive COVID-19 data + forecasting for counties and hospitals (HDSR 2021)

clinical-rule-vetting General pipeline for deriving clinical decision rules

iai-clinical-decision-rule Clinical decision rules for predicting intra-abdominal injury (PLOS Digital Health 2022)

molecular-partner-prediction Predicting successful CME events using only clathrin markers

Various aspects of deep learning and machine learning

gan-vae-pretrained-pytorch Pretrained GANs + VAEs + classifiers for MNIST/CIFAR in pytorch

gpt2-paper-title-generator Generating paper titles with GPT-2

disentangled-attribution-curves Attribution curves for interpreting tree ensembles trees (arxiv 2019)

matching-with-gans Matching in GAN latent space for better bias benchmarking. (CVPR workshop 2021)

data-viz-utils Functions for easily making publication-quality figures with matplotlib

mdl-complexity Revisiting complexity and the bias-variance tradeoff (TOPML workshop 2021)

Projects advised

pasta Post-hoc Attention Steering for LLMs (ICLR 2024), led by Qingru Zhang

meta-tree Learning a Decision Tree Algorithm with Transformers (arXiv 2024), led by Yufan Zhuang

explanation-consistency-finetuning Consistent Natural-Language Explanations (arXiv 2024), led by Yanda Chen

Open-source contributions

Major: autogluon , big-bench , nl-augmenter

Minor: conference-acceptance-rates , iterative-random-forest , interpretable-ml-book , awesome-interpretable-machine-learning , awesome-machine-learning-interpretability , awesome-llm-interpretability , executable-books , deep-fMRI-dataset

Mini-projects

hummingbird-tracking, imodels-experiments, cookiecutter-ml-research, nano-descriptions, news-title-bias, java-mini-games, imodels-data, news-balancer, arxiv-copier, dnn-experiments, max-activation-interpretation-pytorch, acronym-generator, hpa-interp, sensible-local-interpretations, global-sports-analysis, mouse-brain-decoding, ...

imodels's People

Contributors

Stargazers

Watchers

Forkers

pacmed sandy4321 tombewley savvastj abdulk084 bachsh allensmile jeromeku peterxchen-eng michele-pacmed 321hg alucardkratos ukamath saimishra medical-projects jiayouwyhit trendingtechnology sms1097 marcovirgolin vishalbelsare berkeleyml response4life jzl429 cameleogrey teakfi ejhortala rohitpandey13 pked01 stjordanis adbmd daywatch manikant92 bgkyer abhipn jankrk chanjeunlam jarvisloh hmasrur o7s8r6 omerronen austinsaragih live5 enriczhang khuyentran1401 lixixibj anonymousr007 gel1has3 peerachetporkaew drsaadla ai-hub-deep-learning-fundamental gomlfx jaedukseo ansleybrown1337 statmixedml ashishpatel26 techthiyanes shalevy1 thewchan saitama-97 webclinic017 maz2198 flaviendeseure python-repository-hub cryptowealth-technology bakhtiaris athammad szhan tzhang-nmdp phonchi valeman dasdristanta13 treefriend arauchen fvaduva austinserif kn-slyr dingyi-lai overfittingstudyroom xueyagaga manuelhrokr its-play wanghaoxue0 pelegshilo tamanna18 mattpontifex amandamcgow philipluk josekurian holoubekm xhan97 dylan-slack mcschmitz davidefiocco munozariasjm andtise jckkvs mattyshen fabienpe wzhen32 harel-coffee

imodels's Issues

The issue in ruleFit.get_rules(exclude_zero_coef=True).

Hi, the .ix is deprecated, so when I try to call ruleFit.get_rules(exclude_zero_coef=True), it will have an error.

if exclude_zero_coef:
rules = rules.ix[rules.coef != 0]

GreedyRuleListClassifier has wildly varying performance and sometimes crashes

When running a certain number of experiments with different splits of a given dataset, I see that GreedyRuleListClassifier's accuracy wildly varies, and sometimes the code (see for loop below) crashes.

So, for example running 10 experiments like this, with different random splits of the same set:

import pandas
import sklearn
import sklearn.datasets
from sklearn.model_selection import train_test_split

from imodels import GreedyRuleListClassifier

X, Y = sklearn.datasets.load_breast_cancer(as_frame=True, return_X_y=True)

model = GreedyRuleListClassifier(max_depth=10)

for i in range(10):
  try:
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
    model.fit(X_train, y_train, feature_names=X_train.columns)
    y_pred = model.predict(X_test)
    from sklearn.metrics import accuracy_score
    score = accuracy_score(y_test.values,y_pred)
    print('Accuracy:\n', score)
  except KeyError as e:
    print("Failed with KeyError")

Will give as output something along the lines of

Accuracy: 0.6081871345029239
Failed with KeyError
Accuracy: 0.4619883040935672
Accuracy: 0.45614035087719296
Accuracy: 0.2222222222222222
Failed with KeyError
Failed with KeyError
Failed with KeyError
Accuracy: 0.18128654970760233
Failed with KeyError

Is this intended behavior? While my test dataset is smallish, the variation in accuracy is still surprising for me and so is the throwing of a KeyError. I'm using scikit-learn==1.0.2 and imodels=1.3.6 and can edit the issue here to add more details.

Incidentally, the same behaviour was observed in https://datascience.stackexchange.com/a/116283/50519, noticed by @jonnor.

Thanks!

Parallelize HSTree application to fitted forest models

I looked at the inner code of HSTree and for forest models it is simply for looping through the trees and applying the reg_param update. This works, but will only use 1 CPU core. Sklearn when training these forest models uses multithreading/processing to speed things up considerably. Have the authors considered adding parallel support to HSTree for forest models?

GOSDT unable to print tree

I am utilizing the OptimalTreeClassifier model from GOSDT as shown in the repository example itself. But unable to print the tree and is throwing an "AttributeError: 'OptimalTreeClassifier' object has no attribute 'classes_' " error as shown

But this is throwing a error internally
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_33/809247974.py in
----> 1 test_classification_binary()

/tmp/ipykernel_33/280002779.py in test_classification_binary()
23 # test acc
24 acc_train = np.mean(preds == new_y)
---> 25 print(type(m),m, 'final acc', acc_train)
26 assert acc_train > 0.8, 'acc greater than 0.8'

/opt/conda/lib/python3.7/site-packages/imodels/tree/cart_wrapper.py in str(self)
58 return 'GreedyTree:\n' + export_text(self, feature_names=self.feature_names, show_weights=True)
59 else:
---> 60 return 'GreedyTree:\n' + export_text(self, show_weights=True)
61
62

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74

/opt/conda/lib/python3.7/site-packages/sklearn/tree/export.py in export_text(decision_tree, feature_names, max_depth, spacing, decimals, show_weights)
872 tree = decision_tree.tree_
873 if is_classifier(decision_tree):
--> 874 class_names = decision_tree.classes_
875 right_child_fmt = "{} {} <= {}\n"
876 left_child_fmt = "{} {} > {}\n"

AttributeError: 'OptimalTreeClassifier' object has no attribute 'classes_'
`

RuleFitClassifier(tree_generator = GradientBoostingClassifier()) not working as per documentation

Hi,

When using RuleFitClassifier(tree_generator = GradientBoostingClassifier()) with a GradientBoostingClassifier() object fitted and optimized separately via Scikitlearn API, it returns the next error when fitting RuleFitClassifier(tree_generator = GradientBoostingClassifier()):

ValueError: n_estimators=1 must be larger or equal to estimators_.shape[0]=100 when warm_start==True

When inspecting whats inside RuleFitClassifier(tree_generator = GradientBoostingClassifier()) after fitting the model, the GradientBoostingClassifier() is completely modified to other parameters different than those optimized before fitting RuleFitClassifier(), i.e., GradientBoostingClassifier(max_leaf_nodes=4, n_estimators=1, random_state=0, warm_start=True). Not sure why these parameters (from the GradientBoostingClassifier()) are changed inside the RuleFitClassifier() object.

If RuleFitClassifier(tree_generator = None), everything works well.

As per documentation:

tree_generator : Optional: this object will be used as provided to generate the rules.
This will override almost all the other properties above. Must be GradientBoostingRegressor(), GradientBoostingClassifier(), or RandomForestRegressor()

Which are those properties from RuleFitClassifier() that are override if tree_generator=GradientBoostingClassifier()?
Why does this behavior occurs?

Here is the closest solution I found in Issue #34, however the behavior is not clear.

Any help will be highly appreciated.

Many thanks!

BayesianRuleListClassifier Type Error with Categorical Features

I'm getting the following error when I try to use a string variable in my dataset:

TypeError Traceback (most recent call last)
in
----> 1 brl.fit(X_train, y_train, undiscretized_features=['agag_id'])

~/opt/anaconda3/lib/python3.8/site-packages/imodels/rule_list/bayesian_rule_list/bayesian_rule_list.py in fit(self, X, y, feature_labels, undiscretized_features, verbose)
119 raise Exception("Only binary classification is supported at this time!")
120
--> 121 itemsets, self.discretizer = extract_fpgrowth(X, y,
122 feature_labels=feature_labels,
123 minsupport=self.minsupport,

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/extract.py in extract_fpgrowth(X, y, feature_labels, minsupport, maxcardinality, undiscretized_features, verbose)
31
32 discretizer = BRLDiscretizer(X, y, feature_labels=feature_labels, verbose=verbose)
---> 33 X = discretizer.discretize_mixed_data(X, y, undiscretized_features)
34 X_df_onehot = discretizer.onehot_df
35

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in discretize_mixed_data(self, X, y, undiscretized_features)
286 "Warning: non-categorical data found. Trying to discretize. (Please convert categorical values to "
287 "strings, and/or specify the argument 'undiscretized_features', to avoid this.)")
--> 288 X = self.discretize(X, y)
289
290 self.discretized_X = X

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in discretize(self, X, y)
297 print("Discretizing ", self.discretized_features, "...")
298 D = pd.DataFrame(np.hstack((X, np.array(y).reshape((len(y), 1)))), columns=list(self.feature_labels) + ["y"])
--> 299 self.discretizer = MDLP_Discretizer(dataset=D, class_label="y", features=self.discretized_features)
300
301 cat_data = pd.DataFrame(np.zeros_like(X))

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in init(self, dataset, class_label, out_path_data, out_path_bins, features)
59 self._cuts = {f: [] for f in self._features}
60 # get cuts for all features
---> 61 self.all_features_accepted_cutpoints()
62 # discretize self._data
63 self.apply_cutpoints(out_data_path=out_path_data, out_bins_path=out_path_bins)

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in all_features_accepted_cutpoints(self)
218 '''
219 for attr in self._features:
--> 220 self.single_feature_accepted_cutpoints(feature=attr)
221 return
222

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in single_feature_accepted_cutpoints(self, feature, partition_index)
190 return
191 # determine whether to cut and where
--> 192 cut_candidate = self.best_cut_point(data=data_partition, feature=feature)
193 if cut_candidate == None:
194 return

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in best_cut_point(self, data, feature)
160 :return: value of cut point with highest information gain (if many, picks first). None if no candidates
161 '''
--> 162 candidates = self.boundaries_in_partition(data=data, feature=feature)
163 # candidates = self.feature_boundary_points(data=data, feature=feature)
164 if not candidates:

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in boundaries_in_partition(self, data, feature)
151 '''
152 range_min, range_max = (data[feature].min(), data[feature].max())
--> 153 return set([x for x in self._boundaries[feature] if (x > range_min) and (x < range_max)])
154
155 def best_cut_point(self, data, feature):

~/opt/anaconda3/lib/python3.8/site-packages/imodels/util/discretization/mdlp.py in (.0)
151 '''
152 range_min, range_max = (data[feature].min(), data[feature].max())
--> 153 return set([x for x in self._boundaries[feature] if (x > range_min) and (x < range_max)])
154
155 def best_cut_point(self, data, feature):

TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'

FIGSRegressor is not reconised by sklearn as regressor

Hi guys,

I am using FIGSRegressor in combination with VotingRegressor and StackingRegressor but I keep getting the following error whenever I run the fit function.

ValueError: The estimator FIGSRegressor should be a regressor.

Please check this example. Is there a workaround or am I missing something?

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingRegressor,VotingRegressor
from sklearn.linear_model import LinearRegression
from imodels import FIGSRegressor

np.random.seed(123)

# generate X and y
n, p = 500, 10
X_sim = np.random.randn(n, p)
y_sim = 1 * X_sim[:, 0] + 2 * X_sim[:, 1] - 1 * X_sim[:, 2] + np.random.randn(n)

base_models = [('figs', FIGSRegressor()),
               ('random_forest', RandomForestRegressor())]

comb_model = VotingRegressor(estimators=base_models,
                                    n_jobs=10,
                                    verbose=2)
comb_model=comb_model.fit(X_sim, y_sim)

meta_model = LinearRegression()
stacking_model = StackingRegressor(estimators=base_models, 
                                    final_estimator=meta_model, 
                                    passthrough=False, 
                                    cv=5,
                                    verbose=2)

stacking_model=stacking_model.fit(X_sim, y_sim)

Bug in Readme: Hierarchical Shrinkage Doesn't Support HistGradientBoostingRegressor

The readme page claims that hierarchical shrinkage supports any sklearn tree-based model, but in reality it only works with sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.GradientBoostingClassifier. When used with sklearn.ensemble.HistGradientBoostingRegressor, the _shrink method is nullified because neither of the two conditions is true:
https://github.com/csinva/imodels/blob/master/imodels/tree/hierarchical_shrinkage.py#L125-L127 . This is same for HistGradientBoostingClassifier.

We should either add support to hist boosting trees, or clarify this in the readme.

Implement Dynamic CDI

Implementing a Dynamic CDIs class based on FIGS.

TODOs:

Implement a sklearn compatible class named D-FIGS in a new file imodels/tree/dynamic_figs.py
Write a test using the PECARN IAI dataset

More details:

The D-FIGS class should inherit from FIGS class, and take an additional dictionary at initialization, corresponding to the features phases.
When applying the fit or predict methods, the class should verify that the matrix $X$ is compatible with the features tiers. For example phase 2 features can be available (not NA) only if all phase 1 features are available (we may refine this logic later).
D-FIGS should infer the phase from the matrix.
The tests should be written in a new file named imodels/tests/dynamic_figs_test.py, using pytest (see package documentation or you can use the figs test as reference)
Before you start writing code, please write down a short description detailing how you are going to implement the dynamic fitting algorithm. Specifically: How does the model infer the current phase of the patient? How do you store the different models for different phases and ensure these are compatible with one another?

@aagarwal1996

Rulefit with categories + multi-column problems

Hi!

Introductory

In the course of exploring the possibilities of rulefit (via models), questions appeared, I would be happy to discuss them/get hints/etc.

First I will describe the dataframe + code, then I will show the results and there will actually be questions about them.

Input data

An data.csv has been generated in which the gross part of cases (oil supply processes) lasts 4-6 hours (case_duration ~20.000), and there are abnormal cases that last 1.5 days (case_duration > 100.000).

It is necessary to find out - what is the reason for this anomaly?

Here are the conditions that affected the high duration :

(this can be seen even by human viewing of the table)

In addition, there is an eventlog at the input (only those columns that had at least a minimal impact are served - this was calculated separately earlier), there is a breakdown condition, there is an understanding of what you want to get at the output.

CODE: Sending an eventlog to rulefit

Here is a jupiter notebook code (change format to .ipynb) with a code, here is an (again) data.csv.

Here is the result:

As you can see, it does not meet expectations somewhat.

Questions:

The first thing that catches your eye is some 0.5 and equal signs in different directions. Why do they appear at the exit? After one-hot-encoding, the algorithm has only 0 and 1, a pure category. He knows how to do without them, by the way. Example of a rule from a simple eventlog:

(everything is right here, there is nothing to find fault with).

Is there any way to tell the algorithm not to generate these numbers for category columns?
While the conditions are from 5 to 9, the algorithm returns only 3-4 conditions, and with an incorrect answer and large coefficient...
Based on the points above, are there any ideas on how to configure the algorithm so that it returns the correct set of rules?

P.S.:

About the documentation: in many ways, multi-format arrays are used for examples - why? In general, all cases from my personal practice are based on tables (pandas), because they are more convenient ...
ConvergenceWarning message: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 0.000e+00, tolerance: 0.000e+00 - there are 2 points:
first, why does it return this message n-times clogging up the console, and not one?
secondly, what specific parameters in the submitted RuleFitRegressor are proposed to increase?

Issues importing BoostedRulesClassifier

Hi
I updated my imodels thorugh pip3 install --upgrade imodels to get the new BoostedRuleSetClassifier. When I run the notebook I get this
cannot import name 'BoostedRulesClassifier' from 'imodels' (/Users/user/opt/anaconda3/lib/python3.7/site-packages/imodels/init.py)

pkg_resources.get_distribution('imodels').version
'0.2.5'

Any clue?

Sample Weight Support?

Hello - thanks all for the very interesting looking package. The hierarchical shrinkage wrapper seems especially interesting/novel. I'm interested in whether it would be possible to add sample weight support to this package? For background, sample weights are a fairly typical part of many scikit-learn packages (e.g., RandomForestRegressor or HistGradientBoostingRegressor, etc...), and are passed via the fit call, e.g., model.fit(X_train, y_train, sample_weight = w_train).

The purpose of sample weights is to increase the weighting of rows/observations based on some external criteria, typically based around how the training data was gathered, e.g., if your data has different sensors of varying sensitivity, you may increase the sample weighting of certain sensors. Or alternatively if your data is aggregated in some form, then you can increase the weights based on the aggregation (e.g., weekly data with a weight of 7, daily data with a weight of 1, etc...).

In terms of implementation, it's typically as simple as multiplying the loss for each row by the sample weights, to increase the model's sensitivity to large weightings, although I'm not sure if the novel hierarchical shrinkage capabilities of this package would present complications.

Thanks again for the very interesting looking package. I look forward to testing and using it.

how to visualize rules for Rule-methods where there is no method ".vizualize()"

Hi!

There are methods such as:
SkopeRulesClassifier
BoostedRulesClassifier
BayesianRuleSetClassifier
OptimalRuleListClassifier
BayesianRuleListClassifier
GreedyRuleListClassifier
FIGS Classifier
FIGSRegressor
etc

Which sort of return a list of rules, but which don't have a convenient method."vizualize()" as for RuleFitClassifier, RuleFitRegressor.

How can I get a list of rules received as a Data Frame for them?

I was trying to get a clear output set through running this example:

https://csinva.io/imodels/rule_set/skope_rules.html

and on your own dataset:

Dataset: data.csv
Code: scope_rules_1_myself.txt (please, change *.txt ext to *.ipynb)

Unfortunately, the result is a bit unreadable:

SkopeRules output

Hello
Thanks for consolidating the implementations into one nice package. I was running SkopeRules on the diabetes dataset and saw that the results are
('Insulin > 142.0 and Age > 26.5', (0.8732394366197183, 0.7005649717514124, 1))
('Insulin <= 187.5 and Insulin > 121.0 and Age > 24.5', (0.8862208393458393, 0.6502540183068366, 3))
('Insulin <= 169.75 and Insulin > 168.75', (1.0, 0.5337078651685393, 1))
('Insulin > 121.0 and BMI > 30.300000190734863 and Age <= 27.5', (0.5128579777907656, 0.17596669877528553, 2))
('Glucose <= 167.5 and Insulin > 169.75', (0.38333333333333336, 0.12921348314606743, 1))
('Insulin <= 169.75 and Insulin > 143.0 and Age <= 26.5', (0.6923076923076923, 0.10465116279069768, 1))

Looking at the code, I see that the first two elements after the Rule are precision and recall, what is the third integer?
Thanks
Uday

Deprecation warning coming from setuptools

While importing imodels in python 3.10.7 I receive the following deprecation warning:

<frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead

After some searching I believe it is coming from the setuptools package in setup.py, but I am not sure how to fix it.

Any 'apply' method like in sklearn so that I can get leave node membership for any data?

BayesianRuleListClassifier error when dataset has both discrete and continuous features

Hello. 👋 I have no issues with the BRL classifier when I'm using datasets with all numeric or all categorical features. But when I use a dataset with both, I get the following error. The categorical feature in this dataset has already been one hot encoded and I'm passing those columns to the "undiscretized_features" parameter, but it looks like it's being encoded again anyway?

KeyError                                  Traceback (most recent call last)
<ipython-input-13-a701282e6a2c> in <module>
      1 cls = BayesianRuleListClassifier()
----> 2 cls.fit(X.values, Y, feature_names = X.columns, undiscretized_features = ["X1_N", "X1_Y"])

C:\ProgramData\Anaconda3\lib\site-packages\imodels\rule_list\bayesian_rule_list\bayesian_rule_list.py in fit(self, X, y, feature_names, undiscretized_features, verbose)
    204         rule_strs = itemsets_to_rules(self.final_itemsets)
    205         self.rules_without_feature_names_ = [Rule(r) for r in rule_strs]
--> 206         self.rules_ = [
    207             replace_feature_name(rule, self.feature_dict_) for rule in self.rules_without_feature_names_
    208         ]

C:\ProgramData\Anaconda3\lib\site-packages\imodels\rule_list\bayesian_rule_list\bayesian_rule_list.py in <listcomp>(.0)
    205         self.rules_without_feature_names_ = [Rule(r) for r in rule_strs]
    206         self.rules_ = [
--> 207             replace_feature_name(rule, self.feature_dict_) for rule in self.rules_without_feature_names_
    208         ]
    209 

C:\ProgramData\Anaconda3\lib\site-packages\imodels\util\rule.py in replace_feature_name(rule, replace_dict)
     74     replaced_agg_dict = {}
     75     for feature, symbol in rule_replaced.agg_dict:
---> 76         replaced_agg_dict[(replace_dict[feature], symbol)] = rule_replaced.agg_dict[(feature, symbol)]
     77     rule_replaced.agg_dict = replaced_agg_dict
     78     return rule_replaced

KeyError: 'X_0_0.0'

HSTree Multiclass Classification Support

Does HSTree support multiclass classification problems with RandomForest / ExtraTrees as the estimator?

From my initial tests it appears buggy. Calling predict_proba with the final model results in lots of NaN predictions, along with warnings during training such as:

/Users/neerick/workspace/virtual/autogluon/lib/python3.8/site-packages/imodels/tree/hierarchical_shrinkage.py:87: RuntimeWarning: invalid value encountered in double_scalars
  val = tree.value[i][0, 1] / (tree.value[i][0, 0] + tree.value[i][0, 1])  # binary classification

If helpful I can try to create a reproducible example.

Here is an example result comparing with sklearn default RF (_og_) with accuracy metric. Because HSTree returns many NaN predictions, the scores are very low.

One observation is the scores get worse the more trees there are in HSTree forests. I'd guess the likelihood of returning a NaN result is increasing with the number of trees.

                       model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       RandomForest_og_n300    0.711651   0.723618        0.985573       0.050956  0.519926                 0.985573                0.050956           0.519926            1       True          1
1       RandomForest_og_n100    0.710154   0.748744        0.453769       0.019050  0.170951                 0.453769                0.019050           0.170951            1       True          2
2        WeightedEnsemble_L2    0.710154   0.748744        0.464755       0.019376  0.295161                 0.010986                0.000326           0.124210            2       True         36
3        RandomForest_og_n40    0.700636   0.698492        0.193009       0.010738  0.088012                 0.193009                0.010738           0.088012            1       True          3
4        RandomForest_og_n20    0.692039   0.698492        0.103616       0.007549  0.057396                 0.103616                0.007549           0.057396            1       True          4
5        RandomForest_og_n10    0.674165   0.688442        0.075296       0.006166  0.041720                 0.075296                0.006166           0.041720            1       True          5
6     RandomForest_hs=10_n10    0.521949   0.537688        0.070260       0.005246  0.082384                 0.070260                0.005246           0.082384            1       True         15
7     RandomForest_hs=50_n10    0.520839   0.517588        0.075151       0.004875  0.071219                 0.075151                0.004875           0.071219            1       True         20
8    RandomForest_hs=0.1_n10    0.520796   0.537688        0.074070       0.005233  0.093299                 0.074070                0.005233           0.093299            1       True         35
9      RandomForest_hs=1_n10    0.520692   0.542714        0.077687       0.005690  0.075061                 0.077687                0.005690           0.075061            1       True         10
10   RandomForest_hs=100_n10    0.519246   0.517588        0.075059       0.006019  0.082536                 0.075059                0.006019           0.082536            1       True         25
11   RandomForest_hs=500_n10    0.488877   0.517588        0.072145       0.005125  0.072223                 0.072145                0.005125           0.072223            1       True         30
12     RandomForest_hs=1_n20    0.485125   0.472362        0.113002       0.006484  0.123639                 0.113002                0.006484           0.123639            1       True          9
13   RandomForest_hs=0.1_n20    0.485005   0.472362        0.111342       0.005953  0.146246                 0.111342                0.005953           0.146246            1       True         34
14    RandomForest_hs=10_n20    0.484833   0.482412        0.104076       0.006577  0.131909                 0.104076                0.006577           0.131909            1       True         14
15    RandomForest_hs=50_n20    0.482896   0.482412        0.115057       0.006263  0.130512                 0.115057                0.006263           0.130512            1       True         19
16   RandomForest_hs=100_n20    0.480840   0.482412        0.108625       0.006045  0.135224                 0.108625                0.006045           0.135224            1       True         24
17   RandomForest_hs=500_n20    0.458035   0.467337        0.108658       0.006302  0.123907                 0.108658                0.006302           0.123907            1       True         29
18     RandomForest_hs=1_n40    0.451434   0.467337        0.185129       0.010619  0.210639                 0.185129                0.010619           0.210639            1       True          8
19   RandomForest_hs=0.1_n40    0.451382   0.467337        0.170597       0.009024  0.244322                 0.170597                0.009024           0.244322            1       True         33
20    RandomForest_hs=10_n40    0.451322   0.467337        0.173382       0.009955  0.210795                 0.173382                0.009955           0.210795            1       True         13
21    RandomForest_hs=50_n40    0.450350   0.467337        0.170041       0.008673  0.236081                 0.170041                0.008673           0.236081            1       True         18
22   RandomForest_hs=100_n40    0.449119   0.467337        0.169396       0.010918  0.226784                 0.169396                0.010918           0.226784            1       True         23
23   RandomForest_hs=500_n40    0.435832   0.472362        0.162881       0.009256  0.202447                 0.162881                0.009256           0.202447            1       True         28
24    RandomForest_hs=1_n100    0.420419   0.452261        0.442328       0.017688  0.480776                 0.442328                0.017688           0.480776            1       True          7
25  RandomForest_hs=0.1_n100    0.420411   0.452261        0.354523       0.018247  0.548557                 0.354523                0.018247           0.548557            1       True         32
26   RandomForest_hs=10_n100    0.419981   0.452261        0.355097       0.017487  0.469547                 0.355097                0.017487           0.469547            1       True         12
27   RandomForest_hs=50_n100    0.419034   0.447236        0.344341       0.021125  0.465810                 0.344341                0.021125           0.465810            1       True         17
28  RandomForest_hs=100_n100    0.418672   0.447236        0.372041       0.018402  0.477048                 0.372041                0.018402           0.477048            1       True         22
29  RandomForest_hs=500_n100    0.415256   0.457286        0.338696       0.017128  0.492786                 0.338696                0.017128           0.492786            1       True         27
30  RandomForest_hs=0.1_n300    0.381049   0.391960        0.967061       0.045552  1.533075                 0.967061                0.045552           1.533075            1       True         31
31   RandomForest_hs=10_n300    0.381049   0.391960        1.109062       0.054005  1.442369                 1.109062                0.054005           1.442369            1       True         11
32    RandomForest_hs=1_n300    0.381040   0.391960        1.677277       0.055421  2.346773                 1.677277                0.055421           2.346773            1       True          6
33   RandomForest_hs=50_n300    0.380945   0.391960        0.889030       0.053650  1.320377                 0.889030                0.053650           1.320377            1       True         16
34  RandomForest_hs=100_n300    0.380885   0.391960        1.031198       0.045266  1.254918                 1.031198                0.045266           1.254918            1       True         21
35  RandomForest_hs=500_n300    0.380816   0.391960        0.948715       0.050209  1.266396                 0.948715                0.050209           1.266396            1       True         26

support categorical input features?

It seems that this package does not support categorical variables? Right?

Unable to replicate results on `diabetes` data from paper

Hi,

I was trying to replicate some of the Random Forest results from the paper, specifically Figure 3(D) for the diabetes dataset - but I am unable to see the gap in AUC, as presented in the paper. Its probably me doing something silly :) - appreciate some help!

To simplify identifying a good max_depth for a Random Forest object, I'm using this class- this allows me to use scikit's GridSearchCV:

import utils as data_utils
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.base import BaseEstimator
from imodels import HSTreeClassifierCV
from matplotlib import pyplot as plt
import seaborn as sns; sns.set()

class HSRF(BaseEstimator):
    def __init__(self, reg_param_space, num_trees, max_depth):
        self.reg_param_space = reg_param_space
        self.num_trees = num_trees
        self.max_depth = max_depth
        self.HSTCV = None
        self.classes_ = None

    def fit(self, X, y):
        base_clf = RandomForestClassifier(n_estimators=self.num_trees, max_depth=self.max_depth)
        clf = HSTreeClassifierCV(base_clf, reg_param_list=self.reg_param_space)
        clf.fit(X, y)
        self.HSTCV = clf
        # this is needed for scikit's scorer to work
        self.classes_ = clf.estimator_.classes_
        return clf

    def predict(self, X):
        return self.HSTCV.predict(X)

    def predict_proba(self, X):
        return self.HSTCV.predict_proba(X)`

And here's my code - the X and y values passed in are from the diabetes dataset:

def run_mwe(X, y):
    reg_param_space = [0.1, 1.0, 10.0, 25.0, 50.0, 100.0]  # these are from the paper, Section 4.2
    num_trees_space = np.arange(2, 21, 2)
    max_depth_space = np.arange(1, 12, 2)

    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7)
    num_folds = 3
    df = pd.DataFrame(columns=['method', 'score', 'num_trees', 'cv_max_depth'])

    # RF experiments first
    for nt in num_trees_space:
        base_clf = RandomForestClassifier(class_weight='balanced', n_estimators=nt)
        clf = GridSearchCV(base_clf, cv=num_folds, param_grid={'max_depth': max_depth_space}, refit=True,
                           scoring='roc_auc')
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append({'method': 'RF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
                       ignore_index=True)

    # HS Tree experiments next
    for nt in num_trees_space:
        clf = GridSearchCV(HSRF(reg_param_space=reg_param_space, num_trees=nt, max_depth=1),
                            param_grid={'max_depth': max_depth_space}, cv=num_folds, scoring='roc_auc', verbose=4,
                            refit=True)
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append(
            {'method': 'HSRF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
            ignore_index=True)

    return df

When I plot the columns score against num_trees in df, I see something like this:

Issue with feature_names in GreedyRuleListClassifier

when i am putting feature_names= X.columns only the first feature is appearing in the rule list and others are appearing as feat i. unable to fix this and request for your kind support.

here is the output snippet:
Selected features: Index(['Processor(P99)_Q', 'Opto(F99)_Q', 'Logic(L99)_Am', 'Qualcom',
'Toshiba', 'ABB', 'Whirlpool', 'Honeywell'],
dtype='object')
mean 0.6 (30 pts)
if Whirlpool >= 153 then 1.0 (16 pts)
mean 0.143 (14 pts)
if feat 1 >= 16882885 then 1.0 (2 pts)
mean 0.0 (12 pts)

BoostedRulesClassifier sometimes throws an exception

Hi,

When I use the BoostedRulesClassifier, it sometimes throws an exception as follows:

This BoostedRulesClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

I find that the exception results from the implementation of the class RuleSet:
` def _eval_weighted_rule_sum(self, X) -> np.ndarray:

    check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders'])

    X = check_array(X)

    if X.shape[1] != self.n_features_:
        raise ValueError("X.shape[1] = %d should be equal to %d, the number of features at training time."
                         " Please reshape your data."
                         % (X.shape[1], self.n_features_))

    df = pd.DataFrame(X, columns=self.feature_placeholders)
    selected_rules = self.rules_without_feature_names_

    scores = np.zeros(X.shape[0])
    for r in selected_rules: 
        features_r_uses = list(map(lambda x: x[0], r.agg_dict.keys()))
        scores[df[features_r_uses].query(str(r)).index.values] += r.args[0]

    return scores`

Specifically, when the computer runs the check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders']), it finds that self.rules_without_feature_names_ does not exist, so the computer throws the above exception.

And I further review my code and data set, I find that my training set is easy to train a classifier, so the training error of the estimator is close to zero, it may result in a bug in the fit function of the class BoostedRulesClassifier:
`
for _ in range(self.n_estimators):
# Fit a classifier with the specific weights
clf = self.estimator()
clf.fit(X, y, sample_weight=w) # uses w as the sampling weight!
preds = clf.predict(X)
self.estimator_mean_prediction_.append(np.mean(preds)) # just for printing

        # Indicator function
        miss = preds != y

        # Equivalent with 1/-1 to update weights
        miss2 = np.ones(miss.size)
        miss2[~miss] = -1

        # Error
        err_m = np.dot(w, miss) / sum(w)
        
        if err_m < 1e-3:
            return self
          
        # Alpha
        alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))

        # New weights
        w = np.multiply(w, np.exp([float(x) * alpha_m for x in miss2]))

        self.estimators_.append(deepcopy(clf))
        self.estimator_weights_.append(alpha_m)
        self.estimator_errors_.append(err_m)

    rules = []

`
Because the error_m is zero, so it directly returns self without executing subsequent statements, in such a case, self.rules_without_feature_names_ dose not exist.

My current solution to this bug is to modify the following code fragment in the fit function of the class BoostedRulesClassifier:
`
# Error
err_m = np.dot(w, miss) / sum(w)

        # modification ###########################
        if err_m < 1e-3:
            # return self
            w = np.ones(miss.size) / len(y)
            self.estimators_.append(deepcopy(clf))
            self.estimator_weights_.append(float("inf"))
            self.estimator_errors_.append(err_m)
            break
         ####################################
        # Alpha
        alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))

`
I'm not sure whether it may introduce new defects, but it indeed solves the exception.

[feature request] need a `verbose: int` param for each model

I have a training dataset of around 1.5m records. I was trying to get FIGSRegressor to fit it, and it's been running more than 2hrs without any indication about its progress.

It'd be great to have verbose: int param in the constructor to report what's happening within the fitting process based on the level (in int) passed to it.

E.g.

ensemble.RandomForestRegressor(n_jobs=-1, random_state=rand_state, verbose=1)
ensemble.BaggingRegressor(n_jobs=-1, random_state=rand_state, verbose=1)
xgb.XGBRegressor(verbosity=1, booster='gbtree', n_jobs=-1, random_state=rand_state)
lgb.LGBMRegressor(num_leaves=2047, random_state=rand_state, force_col_wise=True, verbose=1)

Thanks.

Add conda-forge package

In addition to the pypi package, please add a conda-forge package (https://conda-forge.org).

You can easily create a boilerplate conda recipe with grayskull (starting from the pypi package): https://github.com/conda-incubator/grayskull

support RandomForestClassifier in RuleFitClassifier?

'BoostedRulesClassifier' object has no attribute 'complexity_'

After imodel being updated to 1.3.8, we've got the error msg 'BoostedRulesClassifier' object has no attribute 'complexity_'. Wonder is it removed or renamed? It is generally better to keep public apis/attributes unchanged during minor releases, any plan to add it back?

`FIGSRegressor.fit()` does not seem to utilize all CPU cores.

I have a training dataset of 1473711 records.

After throwing it to FIGSRegressor.fit(), it's been running for almost 3hrs without evidence of stopping.

Looking at the processes I see there are 4 running parallel:

259883 Sl     0:00 /usr/bin/python -c from joblib.externals.loky.backend.resource_tracker import main; main(60, False)
259885 S      1:47 /usr/bin/python -m joblib.externals.loky.backend.popen_loky_posix --process-name LokyProcess-1 --pipe 73
259886 S      2:18 /usr/bin/python -m joblib.externals.loky.backend.popen_loky_posix --process-name LokyProcess-2 --pipe 74
259887 S      2:01 /usr/bin/python -m joblib.externals.loky.backend.popen_loky_posix --process-name LokyProcess-3 --pipe 75
259888 S      1:48 /usr/bin/python -m joblib.externals.loky.backend.popen_loky_posix --process-name LokyProcess-4 --pipe 77

However in top I see only 1/4 CPU getting utilized:

Any chance this is a bug or something expected?

Thanks.

FIGS print and plot return different trees

In imodels_demo.ipynb the Tree #0 returned by printing the fitted model:

Glucose concentration test <= 99.500 (Tree #0 root)
	Val: 0.068 (leaf)
	Glucose concentration test <= 168.500 (split)
		#Pregnant <= 6.500 (split)
			Body mass index <= 30.850 (split)
				Val: 0.065 (leaf)
				Blood pressure(mmHg) <= 67.000 (split)
					Val: 0.705 (leaf)
					Val: 0.303 (leaf)
			Val: 0.639 (leaf)
		Blood pressure(mmHg) <= 93.000 (split)
			Val: 0.860 (leaf)
			Val: -0.009 (leaf)

and plotting:

do not agree!

Based on _tree_to_str_with_data, which agrees with the simpler _tree_to_str actually being called here - see below, the first line printed after a split is the left / true branch, while the second line after the split is the right / false branch.

Reading the printed version, after the first "Glucose concentration test <= 99.500" split, there should be a leaf with value 0.068 for <= 99.5, and then the "Glucose concentration test <= 168.500" split for > 99.5, but this is not the structure of the plotted tree. Also, note that both of the "Blood pressure(mmHg)" splits should end in leaves, while in the plot one of them leads to a "#Pregnant" split.

The output of print(figs.print_tree(X_train, y_train)) is shown below for reference:

Glucose concentration test <= 99.500 65/192 (33.85%)
	ΔRisk = 0.07 4/59 (6.78%)
	Glucose concentration test <= 168.500 61/133 (45.86%)
		#Pregnant <= 6.500 44/112 (39.29%)
			Body mass index <= 30.850 21/76 (27.63%)
				ΔRisk = 0.06 2/31 (6.45%)
				Blood pressure(mmHg) <= 67.000 19/45 (42.22%)
					ΔRisk = 0.71 10/14 (71.43%)
					ΔRisk = 0.30 9/31 (29.03%)
			ΔRisk = 0.64 23/36 (63.89%)
		Blood pressure(mmHg) <= 93.000 17/21 (80.95%)
			ΔRisk = 0.86 17/19 (89.47%)
			ΔRisk = -0.01 0/2 (0.0%)

Also see my stripped down notebook demonstrating the issue here.

No module named 'irf'

for the IRFClassifier, there seems to have a missing file (irf)

from irf.ensemble import wrf
ModuleNotFoundError: No module named 'irf'

When I look in the source files, this file does not exists.

Thanks!

how to limit number of rules and to limit number of conditions in rules but get best performance for this constrains

can you help to clarify how to set parameters for SBRL
1
to limit number of rules
2
to limit number of conditions in rules

but get best performance for this constrains

like done in by use of max_card=2, to set maximum number of rules to 2
https://github.com/fingoldin/pycorels
C = CorelsClassifier(max_card=2, c=0.0, verbosity=["loud", "samples"])

what and how to use from these
model = RuleListClassifier(max_iter=10000, class1label="diabetes", verbose=False)
'''
Parameters
----------
listlengthprior : int, optional (default=3)
Prior hyperparameter for expected list length (excluding null rule)

listwidthprior : int, optional (default=1)
    Prior hyperparameter for expected list width (excluding null rule)
    
maxcardinality : int, optional (default=2)
    Maximum cardinality of an itemset
    
minsupport : int, optional (default=10)
    Minimum support (%) of an itemset

alpha : array_like, shape = [n_classes]
    prior hyperparameter for multinomial pseudocounts

n_chains : int, optional (default=3)
    Number of MCMC chains for inference

max_iter : int, optional (default=50000)
    Maximum number of iterations
    
class1label: str, optional (default="class 1")
    Label or description of what the positive class (with y=1) means
    
verbose: bool, optional (default=True)
    Verbose output
    
random_state: int
    Random seed

'''

How to get the intercept of the models

Speeding up HS with LOOCV

Hello, thanks again for the great library!

I'm interested in applying HSTree to RandomForest and ExtraTrees models.

According to the documentation, I can specify a random forest object in the estimator_ argument, however this raises an error when I try to fit:

        from sklearn.ensemble import RandomForestClassifier
        model = RandomForestClassifier()
        import imodels
        from imodels.tree.hierarchical_shrinkage import HSTreeClassifier
        model = HSTreeClassifier(estimator_=model)
        model = model.fit(X, y)

  File "/Users/neerick/workspace/code/autogluon/tabular/src/autogluon/tabular/models/rf/rf_model.py", line 232, in _fit
    model = model.fit(X, y, sample_weight=sample_weight)
  File "/Users/neerick/workspace/virtual/autogluon/lib/python3.8/site-packages/imodels/tree/hierarchical_shrinkage.py", line 64, in fit
    self.complexity_ = compute_tree_complexity(self.estimator_.tree_)
AttributeError: 'RandomForestClassifier' object has no attribute 'tree_'

https://github.com/csinva/imodels/blob/master/imodels/tree/hierarchical_shrinkage.py

I don't see any tutorial / documentation for creating a random forest or extra trees model via HSTree, but the paper mentions that this is possible and gets good results. I was wondering if the maintainers could point me to an example or tutorial on this.

Thanks!

Feature Importance

Hi 👋
is there a way to get the feature importance from the RuleFit algorithm through your implementation? 🤔

cannot import FIGSRegressorCV

Hi guys,

This is an amazing library! Thank you for your hard work.

I am testing the FIGSRegressor algorithm after reading your paper but it seems that I cannot load the CV implementation. I have tried both FIGSRegressorCV and FIGSCV but I keep getting ImportError: cannot import name 'FIGSCV' from 'imodels'.

Thank you

Two Extractly same rules by RulefitClassifier

Hello~

When I use the RulefitClassifier, it will return two exactly same rules but with different coef, whether the inherent structures didn't aggregate the rules? I have tried to use the Rulefit directly, and it seems that it doesn't have the similar problem~

The following image is part of my result

RuleFitClassifier not working with simple example using iris data

The following code snippet results in an error:

from sklearn.datasets import load_iris
from imodels import RuleFitClassifier

iris = load_iris()
X, Y = iris.data, iris.target
rulefit = RuleFitClassifier()
rulefit.fit(X, Y)
print(rulefit)

The error reads:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_208411/3401153452.py in <cell line: 9>()
      7 rulefit = RuleFitClassifier()
      8 rulefit.fit(X, Y)
----> 9 print(rulefit)

~/.cache/pypoetry/virtualenvs/xlemoo-6BFI3yUJ-py3.8/lib/python3.8/site-packages/imodels/rule_set/rule_fit.py in __str__(self)
    247         s += '> \tPredictions are made by summing the coefficients of each rule\n'
    248         s += '> ------------------------------\n'
--> 249         return s + self.visualize().to_string(index=False) + '\n'
    250 
    251     def _extract_rules(self, X, y) -> List[Rule]:

~/.cache/pypoetry/virtualenvs/xlemoo-6BFI3yUJ-py3.8/lib/python3.8/site-packages/imodels/rule_set/rule_fit.py in visualize(self, decimals)
    237 
    238     def visualize(self, decimals=2):
--> 239         rules = self._get_rules()
    240         rules = rules[rules.coef != 0].sort_values("support", ascending=False)
    241         pd.set_option('display.max_colwidth', None)

~/.cache/pypoetry/virtualenvs/xlemoo-6BFI3yUJ-py3.8/lib/python3.8/site-packages/imodels/rule_set/rule_fit.py in _get_rules(self, exclude_zero_coef, subregion)
    208         for i in range(0, n_features):
    209             if self.lin_standardise:
--> 210                 coef = self.coef[i] * self.friedscale.scale_multipliers[i]
    211             else:
    212                 coef = self.coef[i]

IndexError: index 4 is out of bounds for axis 0 with size 4

I tried to look into this issue myself, but I am not familiar enough with the method to make any definitive claims. However, this line of code seems fishy. Why not just use the actual number of features stored in self.n_features? Could be a source of the indexing error.

Create mean decrease in impurity (MDI) feature importances for tree-based models

While it is great that the models in imodels are readily interpretable, it would be nice to have a MDI feature importance, i.e. Gini importance, for tree-based models like FIGS to compare to other tree-based models like sklearn's RandomForestClassifier and GradientBoostingClassifier.

Very useful git

Thanks for the work! It's awsome!

Add support for `dtreeviz` visualizations

Add any required translation code to allow imodels trees to be plotted with dtreeviz. This basically boils down to successfully generating a ShadowDecTree object from an imodels tree.

We can reuse the existing ShadowSKDTree constructor by converting imodels trees into sklearn objects, then calling:

sk_dtree = ShadowSKDTree(tree_classifier, X, y, features, target, [0, 1])

Alternatively, we can make an imodels specific implementation of ShadowDecTree, similar to the sklearn implementation here, but that may be more work than necessary.

how to set maximum number of rules and maximum length of rules for rulefit ?

when I set to max_rules = 4
I get this
rules
rule ... importance
0 crim ... 1.437129
7 dis ... 2.151996
12 lstat ... 2.533759
11 black ... 0.683094
1 zn ... 0.307754
9 tax ... 1.825235
8 rad ... 2.150141
10 ptratio ... 1.735849
6 age ... 0.489965
5 rm ... 1.107437
4 nox ... 1.251279
3 chas ... 0.213609
2 indus ... 0.032803
13 rm <= 6.805000066757202 ... 1.701190
17 dis > 1.3727499842643738 & rm <= 6.82150006294... ... 0.221855
22 lstat <= 9.890000343322754 & dis > 1.372749984... ... 0.030449
19 rm <= 6.821500062942505 & dis <= 2.00444996356... ... 0.342680
18 rm > 6.821500062942505 & lstat <= 4.7200000286... ... 2.445310
21 rm > 6.821500062942505 & lstat <= 4.7200000286... ... 0.836880
15 dis <= 1.3727499842643738 & rm <= 6.8215000629... ... 2.641944

[20 rows x 5 columns]

rules.columns
Index(['rule', 'type', 'coef', 'support', 'importance'], dtype='object')

from imodels import RuleFit
import pandas as pd
import numpy as np

load some data

#boston_data = pd.read_csv("../data/boston.csv", index_col=0)
boston_data = pd.read_csv("boston.csv", index_col=0)
y = boston_data.medv.values
X = boston_data.drop("medv", axis=1)
features = X.columns
X = X.values

fit a rulefit model

rf = RuleFit(max_rules = 4)
'''
Parameters
----------
tree_size: Number of terminal nodes in generated trees. If exp_rand_tree_size=True,
this will be the mean number of terminal nodes.
sample_fract: fraction of randomly chosen training observations used to produce each tree.
FP 2004 (Sec. 2)
max_rules: approximate total number of rules generated for fitting. Note that actual
number of rules will usually be lower than this due to duplicates.
memory_par: scale multiplier (shrinkage factor) applied to each new tree when
sequentially induced. FP 2004 (Sec. 2)
rfmode: 'regress' for regression or 'classify' for binary classification.
lin_standardise: If True, the linear terms will be standardised as per Friedman Sec 3.2
by multiplying the winsorised variable by 0.4/stdev.
lin_trim_quantile: If lin_standardise is True, this quantile will be used to trim linear
terms before standardisation.
exp_rand_tree_size: If True, each boosted tree will have a different maximum number of
terminal nodes based on an exponential distribution about tree_size.
(Friedman Sec 3.3)
model_type: 'r': rules only; 'l': linear terms only; 'rl': both rules and linear terms
random_state: Integer to initialise random objects and provide repeatability.
tree_generator: Optional: this object will be used as provided to generate the rules.
This will override almost all the other properties above.
Must be GradientBoostingRegressor or GradientBoostingClassifier, optional (default=None)
'''
rf.fit(X, y, feature_names=features)

calculate mse on the training data

preds = rf.predict(X)
print(f'train mse: {np.mean(np.square(preds-y)):0.2f}')

rules = rf.get_rules()

rules = rules[rules.coef != 0].sort_values("support", ascending=False)

'rule' is how the feature is constructed

'coef' is its weight in the final linear model

'support' is how many points it applies to

#rules[['rule', 'coef', 'support']].head().style.background_gradient(cmap='viridis')
rules[['rule', 'coef', 'support']].head()
q=0

Add "Fast & Frugal Trees" (AKA "fftrees")

Willing to help.

References:

SLIMRegressor error

Running SLIMRegressor on the Regression dataset https://www.kaggle.com/mirichoi0218/insurance with the handling of categorical values, the error I get is

SolverError Traceback (most recent call last)
in
1 from imodels import SLIMRegressor
2 rf = SLIMRegressor()
----> 3 rf.fit(X, y)

~/opt/anaconda3/lib/python3.7/site-packages/imodels/algebraic/slim.py in fit(self, X, y, lambda_reg, sample_weight)
49
50 # solve the problem using an appropriate solver
---> 51 prob.solve()
52 self.model.coef_ = w.value.astype(np.int)
53 self.model.intercept_ = 0

~/opt/anaconda3/lib/python3.7/site-packages/cvxpy/problems/problem.py in solve(self, *args, **kwargs)
394 else:
395 solve_func = Problem._solve
--> 396 return solve_func(self, *args, **kwargs)
397
398 @classmethod

~/opt/anaconda3/lib/python3.7/site-packages/cvxpy/problems/problem.py in _solve(self, solver, warm_start, verbose, gp, qcp, requires_grad, enforce_dpp, **kwargs)
749
750 data, solving_chain, inverse_data = self.get_problem_data(
--> 751 solver, gp, enforce_dpp)
752 solution = solving_chain.solve_via_data(
753 self, data, warm_start, verbose, kwargs)

~/opt/anaconda3/lib/python3.7/site-packages/cvxpy/problems/problem.py in get_problem_data(self, solver, gp, enforce_dpp)
498 self._cache.invalidate()
499 solving_chain = self._construct_chain(
--> 500 solver=solver, gp=gp, enforce_dpp=enforce_dpp)
501 self._cache.key = key
502 self._cache.solving_chain = solving_chain

~/opt/anaconda3/lib/python3.7/site-packages/cvxpy/problems/problem.py in _construct_chain(self, solver, gp, enforce_dpp)
655 A solving chain
656 """
--> 657 candidate_solvers = self._find_candidate_solvers(solver=solver, gp=gp)
658 return construct_solving_chain(self, candidate_solvers, gp=gp,
659 enforce_dpp=enforce_dpp)

~/opt/anaconda3/lib/python3.7/site-packages/cvxpy/problems/problem.py in _find_candidate_solvers(self, solver, gp)
614 in incorrect solutions and is not recommended.
615 """
--> 616 raise error.SolverError(msg)
617 candidates['qp_solvers'] = [
618 s for s in candidates['qp_solvers']

SolverError:

                You need a mixed-integer solver for this model. Refer to the documentation
                    https://www.cvxpy.org/tutorial/advanced/index.html#mixed-integer-programs
                for discussion on this topic.

                Quick fix 1: if you install the python package CVXOPT (pip install cvxopt),
                then CVXPY can use the open-source mixed-integer solver `GLPK`.

                Quick fix 2: you can explicitly specify solver='ECOS_BB'. This may result
                in incorrect solutions and is not recommended.

tree_generator can not support RuleFitClassifier

Hello~ I find when I use the RuleFitClassifier with the parameter tree_generator, there are some problems:

In your document, you have mentioned that the value Must be GradientBoostingRegressor or GradientBoostingClassifier.

But when I am doing it, it has errors to remind me to use RandomForest and BoostingRegressor.

I want to confirm whether the tree_generote can not support the classifier??? Or is my wrong to use it, can you give an example about how to do so? Thanks!!

My errors:

[JOSS] Review

Hi @vissarion

I reviewed the manuscript as well as the repository.

This work provides very important direction in interpretable machine learning, and supports such interpretable models without performance loss. Moreover, it is implemented by following scikit-learn API style, so that it can be easily used by any practitioners.

My one issue #43 has already been resolved, and it complies with the conventions of open source project as well as JOSS guideline.

For this reason, I recommend acceptance of this work.

Best,
Jungtaek.

BRL input handling

Sometimes fails when feature_names not passed...

Sometimes fails when only passed one input column

SlimRegressor error

When I run SlimRegressor on a regression data
from imodels import SLIMRegressor
model = SLIMRegressor()
for lambda_reg in [0, 1e-2, 5e-2, 1e-1, 1, 2]:
model.fit(X_train, y_train, lambda_reg)
mse = np.mean(np.square(y_train - model.predict(X_train)))
print(f'lambda: {lambda_reg}\tmse: {mse: 0.2f}\tweights: {model.model.coef_}')

I get

SolverError: Either candidate conic solvers (['GLPK_MI']) do not support the cones output by the problem (SOC, NonNeg), or there are not enough constraints in the problem.

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
in
2 model = SLIMRegressor()
3 for lambda_reg in [0, 1e-2, 5e-2, 1e-1, 1, 2]:
----> 4 model.fit(X_train, y_train, lambda_reg)
5 mse = np.mean(np.square(y_train - model.predict(X_train)))
6 print(f'lambda: {lambda_reg}\tmse: {mse: 0.2f}\tweights: {model.model.coef_}')

~/opt/anaconda3/lib/python3.7/site-packages/imodels/algebraic/slim.py in fit(self, X, y, lambda_reg, sample_weight)
58 except:
59 m = Lasso(alpha=lambda_reg)
---> 60 m.fit(X, y, sample_weight=sample_weight)
61 self.model.coef_ = np.round(m.coef_).astype(np.int)
62 self.model.intercept_ = m.intercept_

TypeError: fit() got an unexpected keyword argument 'sample_weight'

Missing import in README

In the README file example use of the library, the model used is a HSTreeRegressorCV and that requires an import in

imodels/readme.md

Line 31 in a86d13b

    
           from imodels import get_clean_dataset, HSTreeClassifierCV # import any model here

so that it can be used in

imodels/readme.md

Line 39 in a86d13b

    
           model = HSTreeRegressorCV(max_leaf_nodes=4)  # initialize a tree model and specify only 4 leaf nodes

That should be fixed in #143.

[JOSS] Guide for installation

Hi authors,

I am reviewing your submission and imodels repository.

I am looking into your software, but I cannot find a guide for installing your software.

If you support PyPI installation or installation with setuptools, please add some description about these for the future users of your software.

Best,
Jungtaek.

csinva / imodels Goto Github PK

imodels's Introduction

🌳 Interpretable models / dataset explanations

🤖 General-purpose AI packages and cheatsheets

🧠 Interpreting neural networks

📊 Data-science problems

Various aspects of deep learning and machine learning

Projects advised

Open-source contributions

Mini-projects

imodels's People

Contributors

Stargazers

Watchers

Forkers

imodels's Issues

Introductory

Input data

CODE: Sending an eventlog to rulefit

P.S.:

load some data

fit a rulefit model

calculate mse on the training data

'rule' is how the feature is constructed

'coef' is its weight in the final linear model

'support' is how many points it applies to

Recommend Projects

Recommend Topics

Recommend Org