bcg-x-official / facet Goto Github PK
View Code? Open in Web Editor NEWHuman-explainable AI.
Home Page: https://bcg-x-official.github.io/facet
License: Apache License 2.0
Human-explainable AI.
Home Page: https://bcg-x-official.github.io/facet
License: Apache License 2.0
Describe the bug
The current UnivariateProbabilitySimulator()
appears to have two issues:
To Reproduce
To reproduce these errors please got to the branch: https://github.com/BCG-Gamma/facet/tree/docs/notebook_updates
and within the sphinx > source > tutorial folder run the notebook: https://github.com/BCG-Gamma/facet/blob/docs/notebook_updates/sphinx/source/tutorial/Prediabetes_classification_with_Facet.ipynb
Expected behavior
In both cases I expect to be able to get a complete figure with simulated trend and CIs for a feature displayed correctly.
Screenshots
First error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-60-ea5a7119572e> in <module>
2 simulator = UnivariateProbabilitySimulator(crossfit=ranker.best_model_crossfit, n_jobs=-1)
3 partitioner = ContinuousRangePartitioner()
----> 4 univariate_simulation = simulator.simulate_feature(name=sim_feature, partitioner=partitioner)
C:\Projects\facet\facet\src\facet\simulation\_simulation.py in simulate_feature(self, name, partitioner)
182 raise NotImplementedError("multi-output simulations are not supported")
183
--> 184 simulation_values = partitioner.fit(sample.features.loc[:, name]).partitions()
185 simulation_results = self._aggregate_simulation_results(
186 results_per_split=self._simulate_feature_with_values(
C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in fit(self, values, lower_bound, upper_bound, **fit_params)
202 # calculate the step count based on the maximum number of partitions,
203 # rounded to the next-largest rounded value ending in 1, 2, or 5
--> 204 self._step = step = self._step_size(lower_bound, upper_bound)
205
206 # calculate centre values of the first and last partition;
C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in _step_size(self, lower_bound, upper_bound)
334 def _step_size(self, lower_bound: float, upper_bound: float) -> float:
335 return RangePartitioner._ceil_step(
--> 336 (upper_bound - lower_bound) / (self.max_partitions - 1)
337 )
338
C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in _ceil_step(step)
294 raise ValueError("arg step must be positive")
295
--> 296 return min(10 ** math.ceil(math.log10(step * m)) / m for m in [1, 2, 5])
297
298 @staticmethod
C:\Projects\facet\facet\src\facet\simulation\partition\_partition.py in <genexpr>(.0)
294 raise ValueError("arg step must be positive")
295
--> 296 return min(10 ** math.ceil(math.log10(step * m)) / m for m in [1, 2, 5])
297
298 @staticmethod
ValueError: cannot convert float NaN to integer
Second error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-58-22d13aac8dd5> in <module>
----> 1 SimulationDrawer().draw(data=univariate_simulation, title=sim_feature)
C:\Projects\facet\facet\src\facet\simulation\viz\_draw.py in draw(self, data, title)
73 if title is None:
74 title = f"Simulation: {data.feature}"
---> 75 super().draw(data=data, title=title)
76
77 @classmethod
C:\Projects\facet\pytools\src\pytools\viz\_viz.py in draw(self, data, title)
104 # noinspection PyProtectedMember
105 style._drawing_start(title)
--> 106 self._draw(data)
107 # noinspection PyProtectedMember
108 style._drawing_finalize()
C:\Projects\facet\facet\src\facet\simulation\viz\_draw.py in _draw(self, data)
96 partitions=simulation_series.partitions,
97 frequencies=simulation_series.frequencies,
---> 98 is_categorical_feature=data.partitioner.is_categorical,
99 )
100
C:\Projects\facet\facet\src\facet\simulation\viz\_style.py in draw_uplift(self, feature, target, values_label, values_median, values_min, values_max, values_baseline, percentile_lower, percentile_upper, partitions, frequencies, is_categorical_feature)
178
179 # add a horizontal line at y=0
--> 180 ax.axhline(y=values_baseline, linewidth=0.5)
181
182 # remove the top and right spines
C:\Anaconda3\envs\facet-develop\lib\site-packages\matplotlib\axes\_axes.py in axhline(self, y, xmin, xmax, **kwargs)
860 self._process_unit_info(ydata=y, kwargs=kwargs)
861 yy = self.convert_yunits(y)
--> 862 scaley = (yy < ymin) or (yy > ymax)
863
864 trans = self.get_yaxis_transform(which='grid')
TypeError: '>' not supported between instances of 'float' and 'method'
Desktop (please complete the following information):
Is your feature request related to a problem? Please describe.
Simulation confidence intervals are computed through bootstrapping. If too few bootstrap splits are used then the confidence intervals are not reliable.
Describe the solution you'd like
In the UnivatiateSimulator classes, emit a warning if the number of bootstrap splits falling above or below the confidence interval is less than 25.
For example, given 1000 bootstrap splits and a confidence interval of (2.5%,97.5%) we will have 25 splits below the 2.5% threshold and 25 splits above the 97.5% threshold so no warning will be emitted.
Using fewer splits will generate a warning, recommending to tighten the confidence interval or to increase the number of splits.
Describe the bug
Running pytest
on test suite currently produces 25 warnings on develop
.
To Reproduce
See here for the latest run via Azure Pipelines (or run pytest
locally).
Expected behavior
No warnings, if possible.
Describe the bug
README.rs mentions details of the dataset being used, but then imports a CSV that users might not have. Given the dataset is from sklearn, we can improve the code-block for more usability
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Automated dataload directly from sklearn, with no read_csv step
Screenshots
(https://user-images.githubusercontent.com/91214098/134666712-7c22ecca-a556-42d1-9270-10b5e54c5eb5.png)
Is your feature request related to a problem? Please describe.
Hello,
First, congratulations on the library, this is useful concepts.
I would like to know if this library has plans to be implemented on other ML libraries like Tensorflow / Pytorch? I don't see any limitations.
Is your feature request related to a problem? Please describe.
I've encountered an issue when trying to use scikit-learn models with the facet library, specifically with the LearnerInspector class. The library currently only supports models of type SupervisedLearnerPipelineDF or SupervisedLearnerDF.
Describe the solution you'd like
I'd appreciate it if the facet library could extend support to include scikit-learn models. The API could potentially look like this:
from facet.inspection import LearnerInspector
inspector = LearnerInspector(
model=scikit_learn_model_instance,
n_jobs=-3
).fit(sample_data)
In this example, scikit_learn_model_instance could be any model instance from the scikit-learn library (RanfomForestRegressor, ).
Describe alternatives you've considered
One alternative is to manually adapt scikit-learn models to a format compatible with the facet library. However, native support would streamline the process, especially for those heavily using scikit-learn.
Additional context
Extending support for scikit-learn models would enhance the utility of the facet library, especially for data scientists and ML engineers who frequently use scikit-learn.
I am also interested in the SAGE value, which measures the contribution to global prediction accuracy, compared to SHAP, which measures the importance of local predictions.
We believe that if SRI decomposition can contribute to prediction accuracy, it will be an important advance in terms of model interpretability and feature selection.
On the other hand, I understand that it is not easy to support SAGE, which is a different concept, but is it likely to be realized?
https://pypi.org/project/sage-importance/
https://iancovert.com/blog/understanding-shap-sage/
Is your feature request related to a problem? Please describe.
When sample weights are applied to a learner classifier the up weighting on one class will be reflected in a higher predicted probability than what was observed in the unweighted data.
Describe the solution you'd like
There are two aspects to an ideal solution:
ClassifierPipelineDF
class, where the default is set to true but can be turned off if desired. This could help ensure that naively even with weights applied to learning the probabilities shown in the simulation outputs align reasonably well with those observed in the data.Describe alternatives you've considered
None.
Additional context
None.
Hello,
I installed gamma-facet==1.0.1, together with shap==0.38.1 (both latest versions).
When calling
from facet.inspection import LearnerInspector
I get the error
ModuleNotFoundError: No module named 'shap.explainers.explainer'
ar line 13 of python3.8/site-packages/facet/inspection/_explainer.py .
Are you able to provide a fix for this?
Thanks!
Hi,
Thanks for sharing the package, it is very interesting. I'm trying to understand how synergy works more in detail, but I haven't find a satisfactory answer so far.
1. If synergy (as well as redundancy) relies on the SHAP interaction values, which are symmetric, how do you make it asymmetric ? How would you describe the big steps to compute this metric ?
Intuitively, I better understand what a symmetric interaction is - in a certain way, it quantifies what are the additional contributions on the output when the features are together - but giving an asymmetric definition is harder. In the examples, you mention terms such as "the feature is autonomous" or "a feature gets you much closer to a prediction than another" but it is a bit vague and/or I don't see how this is related to feature interaction, but rather to feature importance / correlation.
2. From a practical standpoint, redundancy is a bit easier to understand and can lead to some feature selection (basically if features share the same info, you could think of removing one), but what are the implications of synergy ?
If for example a feature pair has high synergy in one direction and low in the other, what should I conclude ? Should I do some feature selection with it ?
3. Also, if you have some details about the math behind it (a paper or a description), it would be great ! I looked into the code but it is a bit hard to understand it from there
Thanks a lot !
Is your feature request related to a problem? Please describe.
Hey, I saw that the UnivariateProbabilitySimulator
is scheduled for a future release. It would be beneficial for the NHO tutorials to have it available.
Describe the solution you'd like
We would like to use a Univariate simulation for change in average predicted probability (CAPP) based on a classification model.
Describe the bug
Inspector fails to fit for a GradientBoostingClassifier - throws an AssertionError: 1 outputs named ['0', '1']. this is the only classifier for which I have observed this behaviour.
To Reproduce
Using a notebook within the facet-develop env the following code will reproduce the error:
# imports
import pandas as pd
import numpy as np
from facet import Sample
from facet.inspection import LearnerInspector
from facet.selection import LearnerGrid, LearnerRanker
from facet.validation import BootstrapCV
from sklearndf.pipeline import ClassifierPipelineDF
from sklearndf.classification import GradientBoostingClassifierDF
from sklearn.datasets import make_classification
# simulate some data
X, y = make_classification(n_samples=200, n_features=5, n_informative=5, n_redundant=0, random_state=42)
y_df = pd.DataFrame(y, columns=['target'])
X_df = pd.DataFrame(X, columns=['f1', 'f2', 'f3', 'f4', 'f5'])
sim_df = pd.concat([X_df, y_df], axis=1)
# create sample object
sample_df = Sample(observations=sim_df, target='target')
# create grid
grid = [
LearnerGrid(
pipeline=ClassifierPipelineDF(classifier=GradientBoostingClassifierDF(random_state=42)),
learner_parameters={}
)
]
# fit the learner ranker
ranker = LearnerRanker(grids=grid,
cv=BootstrapCV(n_splits=5, random_state=42),
n_jobs=-1,
verbose = 2,
scoring= "roc_auc").fit(sample=sample_df)
# fit the inspector
LearnerInspector(n_jobs=-1).fit(crossfit=ranker.best_model_crossfit)
You should then see the error:
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\externals\loky\process_executor.py", line 418, in _process_worker
r = call_item()
File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\externals\loky\process_executor.py", line 272, in __call__
return self.fn(*self.args, **self.kwargs)
File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\_parallel_backends.py", line 567, in __call__
return self.func(*args, **kwargs)
File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py", line 225, in <listcomp>
for func, args, kwargs in self.items]
File "c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_shap.py", line 523, in _shap_for_split
), f"{len(shap_interaction_tensors)} outputs named {multi_output_names}"
AssertionError: 1 outputs named ['0', '1']
"""
The above exception was the direct cause of the following exception:
AssertionError Traceback (most recent call last)
<ipython-input-3-7b5a54a21400> in <module>
18
19 # fit the inspector
---> 20 LearnerInspector(n_jobs=-1).fit(crossfit=ranker.best_model_crossfit)
c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_inspection.py in fit(self, crossfit, **fit_params)
244 shap_decomposer = ShapValueDecomposer()
245
--> 246 shap_calculator.fit(crossfit=crossfit)
247 shap_decomposer.fit(shap_calculator=shap_calculator)
248
c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_shap.py in fit(self, crossfit, **fit_params)
129 # calculate shap values and re-order the observation index to match the
130 # sequence in the original training sample
--> 131 shap_all_splits_df: pd.DataFrame = self._shap_all_splits(crossfit=crossfit)
132
133 assert shap_all_splits_df.index.nlevels > 1
c:\projects\nho_facet\powerco_nho\facet\src\facet\inspection\_shap.py in _shap_all_splits(self, crossfit)
226 else (
227 training_sample.subsample(iloc=oob_split)
--> 228 for _, oob_split in crossfit.splits()
229 )
230 ),
C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
932
933 with self._backend.retrieval_context():
--> 934 self.retrieve()
935 # Make sure that we get a last message telling us we are done
936 elapsed_time = time.time() - self._start_time
C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\parallel.py in retrieve(self)
831 try:
832 if getattr(self._backend, 'supports_timeout', False):
--> 833 self._output.extend(job.get(timeout=self.timeout))
834 else:
835 self._output.extend(job.get())
C:\Anaconda3\envs\nho_powerco\lib\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
519 AsyncResults.get from multiprocessing."""
520 try:
--> 521 return future.result(timeout=timeout)
522 except LokyTimeoutError:
523 raise TimeoutError()
C:\Anaconda3\envs\nho_powerco\lib\concurrent\futures\_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
C:\Anaconda3\envs\nho_powerco\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
AssertionError: 1 outputs named ['0', '1']
Expected behavior
Expect the inspector to not throw an error and allow subsequent access to redundancy, synergy etc.
Desktop (please complete the following information):
Is your feature request related to a problem? Please describe.
For the feature affinity matrices (redundancy matrix, synergy matrix, association matrix), it's super helpful to have features already in an order to highlight clusters visually. However, in the dendograms the order differs (due to an additional sorting step with respect to feature importance).
Describe the solution you'd like
Apply the dendogram order to the matrices, so you can "rediscover" the already identified clusters when switching visualizations.
Describe alternatives you've considered
Could also adjust the dendogram ordering to match the current matrix ordering, but I think it's helpful to guide the ordering by feature importance, also for the matrices.
Hey!
First of all, this seems like a fantastic library! The visualizations are not notch. I am trying to replicate the example for my YouTube channel (I hope that is allowed), but I have issues importing the libraries. You can find the colab link below:
https://colab.research.google.com/drive/1mYqg8b4VhKvw5d_ExFqE3nge_yr5G_ql?usp=sharing
Here is the code error:
ModuleNotFoundError Traceback (most recent call last)
in ()
4 from sklearndf.pipeline import RegressorPipelineDF
5 from sklearndf.regression import RandomForestRegressorDF
----> 6 from facet.data import Sample
7 from facet.selection import LearnerRanker, LearnerGrid
ModuleNotFoundError: No module named 'facet.data'; 'facet' is not a package
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
Describe the bug
I have trouble importing the LearnerInspector. When I try to import it, it throws the following error:
name 'catboost' is not defined
The code I use for this import is:
from facet.inspection import LearnerInspector
Update: the error also appears when I try to import facet.inspection (see full import output down below)
However, all of these import and (seem to) work without an issue:
from facet.data import Sample
from facet.selection import LearnerRanker, LearnerGrid
from facet.validation import BootstrapCV
from facet.data.partition import ContinuousRangePartitioner
from facet.simulation import UnivariateProbabilitySimulator
from facet.simulation.viz import SimulationDrawer
from facet.crossfit import LearnerCrossfit
Desktop (please complete the following information):
Down below is the complete output, when I try to import face.inspection:
import facet.inspection
Traceback (most recent call last):
Input In [19] in <cell line: 1>
import facet.inspection
File ~\Anaconda3\lib\site-packages\facet\inspection_init_.py:8 in
from ._explainer import *
File ~\Anaconda3\lib\site-packages\facet\inspection_explainer.py:346 in
__tracker.validate()
File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:200 in validate
update_forward_references(obj, globals_=globals_)
File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:328 in update_forward_references
_update(obj)
File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:315 in _update
_update(member, local_ns=local_ns)
File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:319 in _update
_update_annotations(_obj, local_ns)
File ~\Anaconda3\lib\site-packages\pytools\api_alltracker.py:322 in _update_annotations
annotations = get_type_hints(
File ~\Anaconda3\lib\typing.py:1469 in get_type_hints
value = _eval_type(value, globalns, localns)
File ~\Anaconda3\lib\typing.py:292 in _eval_type
ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args)
File ~\Anaconda3\lib\typing.py:292 in
ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args)
File ~\Anaconda3\lib\typing.py:290 in _eval_type
return t._evaluate(globalns, localns, recursive_guard)
File ~\Anaconda3\lib\typing.py:551 in _evaluate
eval(self.forward_code, globalns, localns),
File :1 in
NameError: name 'catboost' is not defined
#374 related.
Thank you. I have modified your code and considered non-linear models such as KernelRidge.
However, KernelRidge is naturally not compatible with TreeExplainerFactory, so I considered using KernelExplainerFactory or ExactExplainerFactory. However, since ExactExplainerFactory is not usable depending on the size of the dataset, I adopted KernelExplainerFactory(shap_interaction=True).
In this case, a RuntimeError occurs.
RuntimeError: SHAP interaction values have not been calculated. Create an inspector with parameter 'shap_interaction=True' to enable calculations involving SHAP interaction values.
Checking your implementation, it seems that KernelExplainerFactory does not compute shap_interaction.
facet/src/facet/explanation/_explanation.py
Line 377 in 66bea15
I have two questions.
1.
For non-linear models, is it necessary to use ExactExplainerFactory and perform inspector.fit()? What should I do if the data size is large?
2.
The specification that KernelExplainerFactory internally converts shap_interaction=True to False is confusing. Would it be better to throw an error if shap_interaction=True is specified, or change it so that the shap_interaction argument cannot be specified at all?
import pandas as pd
from sklearn.model_selection import RepeatedKFold, GridSearchCV
# some helpful imports from sklearndf
from sklearndf.pipeline import RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF
# relevant FACET imports
from facet.data import Sample
from facet.selection import LearnerSelector, ParameterSpace
from sklearn.datasets import load_diabetes
X,y = load_diabetes(return_X_y=True)
data = load_diabetes()
X = pd.DataFrame(X)
X.columns = data["feature_names"]
y = pd.DataFrame(y)
y.columns = ["target"]
diabetes_df = pd.concat([X,y], axis=1)
# create FACET sample object
diabetes_sample = Sample(observations=diabetes_df, target_name="target")
# create a (trivial) pipeline for a random forest regressor
from sklearn.kernel_ridge import KernelRidge
model = KernelRidge()
model.fit(X,y)
# fit the model inspector
from facet.inspection import NativeLearnerInspector
inspector = NativeLearnerInspector(
model=model,
explainer_factory=KernelExplainerFactory(),
n_jobs=-3,
shap_interaction=True
)
inspector.fit(diabetes_sample)
# visualise synergy as a matrix
from pytools.viz.matrix import MatrixDrawer
synergy_matrix = inspector.feature_synergy_matrix()
# visualise redundancy as a matrix
redundancy_matrix = inspector.feature_redundancy_matrix()
# visualise redundancy using a dendrogram
import matplotlib
from pytools.viz.dendrogram import DendrogramDrawer
redundancy = inspector.feature_redundancy_linkage()
I just read the announcement of this project here. This quote can be found;
BCG GAMMA FACET Helps Human Operators Understand Advanced Machine Learning Models So They Can Make Better and More Ethical Decisions
When I go to this repo (which isn't linked in the blog post by the way) I find a demo that is using the load_boston
dataset to explain how to use the tool. It seems to focus on the LSTAT
attribute of this dataset while it fails to acknowledge a bigger issue related to ethics, namely the B
attribute. According to the docs it refers to:
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
Given that this tool is marketed with themes of ethics and understanding, and that it is backed by an international consultancy company, this is really not cool. This tool is marketed to help people make more ethical decisions. So why does the guide present a model that uses skin color to determine house prices without mentioning it? For a guide that could be seen as an authoritative source on how to handle "ethical decisions" this is really dubious.
Note that this dataset is up for removal from scikit-learn because of this controversy and it's also something that's been pointed out at many conferences. Here is one talk from me if you're interested.
Please replace the guide with an example that is more fitting or at the very least acknowledges the issues with the variable.
It appears the SHAP summary plot has the 'Feature Value' Inverted. Your classification example has 'Waist_to_Height' to be positively correlated with 'Prediabetes'.
From the SHAP output, you can see a higher values of 'Waist_to_Height' (values in red) have negative impact on model, which is the opposite.
I've also tested this using the FACET package and then just running a separate RF model to get the SHAP output outside of FACET and the non-FACET SHAP outputs are as expected and not inverted.
Hi,
Thanks for making this tool openly available. Very cool.
While I could get it to run, I do face some issues due to versioning and compatibility that I wanted to report:
(1) When following the conda Installation instructions, one will install 2.0rc2
. This breaks the Quickstart Tutorial:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 11
9 # relevant FACET imports
10 from facet.data import Sample
---> 11 from facet.selection import LearnerRanker, LearnerGrid
13 # declaring url with data
14 data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'
ImportError: cannot import name 'LearnerRanker' from 'facet.selection' (C:\Users\admin\.conda\envs\jakob-facet\lib\site-packages\facet\selection\__init__.py)
(2) Ideally, I want to run facet with XGBoost. While the core functionality is in sklearndf 2.1.0, facet pins this to version ~= 1.2.
Unfortunately, the newer sklearndf is not compatible, see the following error running 2.1.0 in facet 1.2.3
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 54
51 rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
53 # rank your candidate models by performance (default is mean CV score - 2*SD)
---> 54 ranker = LearnerRanker(
55 grids=rnd_forest_grid, cv=rkf_cv, n_jobs=-3
56 ).fit(sample=diabetes_sample)
58 # get summary report
59 ranker.summary_report()
File ~/miniconda3/envs/facet/lib/python3.8/site-packages/facet/selection/_selection.py:400, in LearnerRanker.fit(self, sample, **fit_params)
387 """
388 Rank the candidate learners and their hyper-parameter combinations using
389 crossfits from the given sample.
(...)
396 :return: ``self``
397 """
398 self: LearnerRanker[T_LearnerPipelineDF] # support type hinting in PyCharm
--> 400 ranking: List[LearnerEvaluation[T_LearnerPipelineDF]] = self._rank_learners(
401 sample=sample, **fit_params
402 )
403 ranking.sort(key=lambda le: le.ranking_score, reverse=True)
405 self._ranking = ranking
...
--> 873 obj = super().__new__(cls)
874 else:
875 obj = super().__new__(cls, *args, **kwds)
TypeError: Can't instantiate abstract class _FitScoreQueue with abstract methods aggregate
Is there a workaround to get it to run with XGBoost?
(3) On another note: I did experience a quadratic increase in memory consumption with the number of features. For a workstation with 128 GB RAM this effectively limits me to ~100 features (@ ~ 5000 rows). Is this expected?
Is your feature request related to a problem? Please describe.
After using the model inspector it would be helpful to have an easy way to access the SHAP values and data for use with plotting methods that are part of the base shap library in python.
Describe the solution you'd like
Add methods to the model inspector to allow users to obtain the SHAP values and associated dataset
Describe alternatives you've considered
None.
Additional context
None.
Is your feature request related to a problem? Please describe.
In order to facilitate simulation instigation, it would be great to have the possibility to obtain the outcome distribution for each partition value
Describe the solution you'd like
The best way would be to either obtain a list of percentiles of the distribution or the full distribution itself
Is your feature request related to a problem? Please describe.
The number of n_splits
used in the crossfit impacts the coverage of observations inspected for calculating SHAP values. With low coverage the number of rows in the consolidated SHAP matrix is less than the number of observations.
Describe the solution you'd like
The ideal solution has a few elements:
Describe alternatives you've considered
None - the above solution is the minimum requirement.
Additional context
As an example using 500 simulated data points we can see that in the extreme case of using n_splits = 1
, we find the SHAP analysis covers 40% of observations:
Describe the bug
Not exhaustive list, but cannot run:
from facet.inspection import LearnerInspector
from facet.selection import LearnerRanker, LearnerGrid
But this runs:
from facet.data import Sample
Which means gamma-facet is installed.
I copy/paste this code from the Facet github repo: https://github.com/BCG-Gamma/facet
Describe the bug
When inspecting a binary classifier, the raw_shap_tensor
of class 0 does not equal to -raw_shap_tensor
of class 1.
It appears that the absolute difference can reach up to 10^-2.
Bug rises in function raw_shap_to_df
To Reproduce
Steps to reproduce the behavior:
4-Facet-modeling-NewAPI
Expected behavior
A clear and concise description of what you expected to happen.
Desktop (please complete the following information):
Is your feature request related to a problem? Please describe.
When looking at the summary output from the LearnerRanker()
it would be great to have an alternative to the print(ranker.summary_report(5))
which prints the top 5 models for example.
An option to allow further summaries generated by the user could be to output a Pandas DataFrame. This would allow flexibility for subsequent uses, for example, outputting to csv's for reports or creating summary figures of performance. The ability to store a DF would also allow users to combine with similar DFs from future runs if updating models to see the changes, etc.
Describe the solution you'd like
One option could be to add an option to export the ranker.summary_report(5)
to a Pandas DataFrame with something like rank_summary = ranker.summary_report(as_dataframe=True)
. Which I would then expect to produce something along the lines of the following as a Pandas DataFrame.
Rank | Learner | Ranking_score | Mean_score | SD_score | Tuned_parameters | N_folds | Socring_metric |
---|---|---|---|---|---|---|---|
1 | LGBMClassifierDF | 0.656 | 0.680 | 0.0122 | classifier__n_estimators=400 | 10 | roc_auc |
2 | LGBMClassifierDF | 0.655 | 0.677 | 0.0111 | classifier__n_estimators=500 | 10 | roc_auc |
3 | RandomForestClassifierDF | 0.650 | 0.695 | 0.0224 | classifier__n_estimators=200 | 10 | roc_auc |
4 | RandomForestClassifierDF | 0.647 | 0.696 | 0.0244 | classifier__n_estimators=300 | 10 | roc_auc |
5 | RandomForestClassifierDF | 0.646 | 0.697 | 0.0255 | classifier__n_estimators=400 | 10 | roc_auc |
Describe alternatives you've considered
Have not considered alternatives.
Describe the bug
I am using a dataset with 130 columns and the 1000 rows. The below steps keeps on running for more than an hour with no results produced
I also tried with just 20 columns and 500 rows. the behavior is the same.
step 1 :
from facet.inspection import LearnerInspector
inspector = LearnerInspector()
inspector.fit(crossfit=ranker.best_model_crossfit_)
step2:
boot_crossfit = LearnerCrossfit(
pipeline=ranker.best_model_,
cv=bscv,
n_jobs=-3,
verbose=True,
).fit(sample=df_sample)
Desktop (please complete the following information):
I have encountered the issue that the attribute _ensure_fitted the LeanerRanker is calling is missing.
This also happens when I run the example code:
standard imports
import pandas as pd
from sklearn.model_selection import RepeatedKFold
some helpful imports from sklearndf
from sklearndf.pipeline import RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF_
relevant FACET imports
from facet.data import Sample
from facet.selection import LearnerRanker, LearnerGrid
declaring url with data
data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'
importing data from url
diabetes_df = pd.read_csv(data_url, delimiter='\t').rename(
renaming columns for better readability
columns={
'S1': 'TC', # total serum cholesterol
'S2': 'LDL', # low-density lipoproteins
'S3': 'HDL', # high-density lipoproteins
'S4': 'TCH', # total cholesterol/ HDL
'S5': 'LTG', # lamotrigine level
'S6': 'GLU', # blood sugar level
'Y': 'Disease_progression' # measure of progress since 1yr of baseline
}
)
create FACET sample object
diabetes_sample = Sample(observations=diabetes_df, target_name="Disease_progression")
create a (trivial) pipeline for a random forest regressor
rnd_forest_reg = RegressorPipelineDF(
regressor=RandomForestRegressorDF(n_estimators=200, random_state=42)
)
define grid of models which are "competing" against each other
rnd_forest_grid = [
LearnerGrid(
pipeline=rnd_forest_reg,
learner_parameters={
"min_samples_leaf": [8, 11, 15],
"max_depth": [4, 5, 6],
}
),
]
create repeated k-fold CV iterator
rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
rank your candidate models by performance (default is mean CV score - 2*SD)
ranker = LearnerRanker(
grids=rnd_forest_grid, cv=rkf_cv, n_jobs=-3
).fit(sample=diabetes_sample)
get summary report
ranker.summary_report()
It is the last line, ranker.summary_report() that produces the error
AttributeError: 'LearnerRanker' object has no attribute '_ensure_fitted'
Indeed, when I check the presence of the attribute it yields 'False':
_print(hasattr(LearnerRanker, 'ensure_fitted'))
If I check the presence of ensure_fitted instead of _ensure_fitted, it yields 'True':
print(hasattr(LearnerRanker, 'ensure_fitted'))
Is your feature request related to a problem? Please describe.
It would be great to have a simulator for absolute target values instead of making the uplift computation for regressors
Describe the solution you'd like
Being able to call a UnivariateTargetSimulator instead of a UnivariateUpliftSimulator
When inspecting a binary classifier, the raw_shap_tensor
of class 0 does not equal to -raw_shap_tensor
of class 1.
It appears that the absolute difference can reach up to 10^-2.
Bug rises in function _raw_shap_to_df
Describe the bug
Sphinx doc does not build due to missing pytools script make_base
if pytools
is not in same parent directory
To Reproduce
Steps to reproduce the behavior:
environment.yml
sphinx
directory, run python make.py html
Traceback (most recent call last):
File "make.py", line 30, in <module>
make()
File "make.py", line 24, in make
from make_base import make
ModuleNotFoundError: No module named 'make_base'
Expected behavior
Docs should build.
Desktop (please complete the following information):
Additional context
Likely this is just a note in contributing.md
to make sure that the remotes are all present before trying to build the docs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.