scikit-learn-contrib / skope-rules Goto Github PK

machine learning with logical rules in Python

Home Page: http://skope-rules.readthedocs.io

License: Other

Batchfile 2.61% Shell 4.34% Jupyter Notebook 60.55% Python 32.50%

skope-rules's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

DenMune: Density-peak clustering using mutual nearest neighbors

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

lightning

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

skglm

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

imbalanced-learn

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

polylearn

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

hdbscan

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

categorical-encoding

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

boruta_py

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

sklearn-pandas

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

skope-rules

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

stability-selection

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

metric-learn

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

skope-rules's People

Contributors

Stargazers

Watchers

Forkers

floriangardin antisrdy hbcbh1999 tengben0905 cdlz mrahim micseb gustavocarita adripurkayastha eycab fsz65 jamshedmelik 8bit-pixies raincls sandeepdasc1 atlaj ifv lawwu ngoix timstaley dhaneshkk whalejasmine pombredanne data-corentinv we1l1n mtchem jeevansandhu drewwilimitis saurabh11baghel kelvinqian pie33000 jzhufd hatboywonder salvatorg sandy4321 benman1 stat17-hb anilkumarpanda gerald4 carloscoelhow arplas manikant92 nikolaismolin shubhamgupta-dat xaipient ngarneau tomlamantia iyy18 fayeab vcdragoon zeta1999 hmy626 anfangermi roytsai27 minalspatil edwardhong0627 cursecatcher grej vishalbelsare andrewtanqb iharshulhan shism2 garyfrate heast90 trendingtechnology robinmarshall55 zhongy95 bruneln artaban96 gognlin jasonnilsh sitresearch jaalu jbarsotti fdoperezi manuelhrokr lkampoli slimhintz phylliade darobles2 patrick-nicholson-czi enriczhang xiangnanyue fudp kaistha23 lanze0000 yeshengxiangyu golembot polesnatalia magomar mincheolkim0 kinghoosung valeriocardoso rouyiding

skope-rules's Issues

Improve documentation for rules_ attribute

The description for the rules_ attribute is given as dict of tuples (rule, precision, recall, nb). What is nb here and how are the rules ordered?

how to work with categorical values data

is it working for with categorical data ?
do you have example
or maybe categorical values data should be transformed to one hot data and then
data may be considered as continues data?

documentation

Documentation:

rules_ : dict of tuples (rule, precision, recall, nb).

Does anyone know what nb is in this context? It appears to be an integer is it the number of positive cases identified by the rule?

issue in mask indexing

Hi, thank you for sharing this great package.

However, I think I might find a mistake in the mask indexing.

mask = ~samples

samples is numpy array, and when you put ~, you can get -(value+1).

ex.
samples = np.array([1,2,3,4])
~samples
[-2, -3, -4, -5]

please check this issue.

Thanks!

ImportError: cannot import name 'Iterable' from 'collections'

Python 3.10
skope-rules==1.0.1

Error

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [1], in <module>
     15 from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
     16 from interpret.glassbox import ExplainableBoostingClassifier
---> 17 from skrules import SkopeRules

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/skrules/__init__.py:1, in <module>
----> 1 from .skope_rules import SkopeRules
      2 from .rule import Rule, replace_feature_name
      4 __all__ = ['SkopeRules', 'Rule']

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/skrules/skope_rules.py:2, in <module>
      1 import numpy as np
----> 2 from collections import Counter, Iterable
      3 import pandas
      4 import numbers

ImportError: cannot import name 'Iterable' from 'collections' (/t/pyenv/versions/3.10.2/lib/python3.10/collections/__init__.py)

Increase variety of rules with a parameter grid

Hello,

Currently the arguments of the SkopeRules object are propagated over all decision trees in its bagging classifier.
It means that all the trees share the same parameters (except for max_depth where a list of depths can be passed). The only differences are the samples they fit on.

It would probably make sense if we could also input a grid of parameters.
It means adding a dict grid_parameters optional argument in the SkopeRules object which stores possible values for all tree's parameters. Each tree is then built using a random combination of these parameters if this argument is specified.

It might help to get more diversified rules for a reasonable amount of estimators.

how to limit number of rules and number of conditions in rules

can you help to clarify how to set parameters
1
to limit number of rules
2
to limit number of conditions in rules

like done in by use of max_card=2, to set maximum number of rules to 2
https://github.com/fingoldin/pycorels
C = CorelsClassifier(max_card=2, c=0.0, verbosity=["loud", "samples"])

conda-forge package

It would be nice to add a skope-rules package to conda-forge https://conda-forge.org/ (in addition to pypi)

P.S.
You can use grayskull https://github.com/conda-incubator/grayskull to generate a boilerplate for the conda recipe.

The terms of the rules could be re-ordered in a better way, with hardly no computational cost.

Currently, rules terms are ordered alphabetically regarding to the variable names.
Extra interpretability could be added by ordering the terms by decreasing "variable importance".

Reference: A Survey Of Methods For Explaining Black Box Models, R. Guidotti et al., Feb. 2018
Extract of the paper: "The interpretation of rules and decision trees is different with respect to different aspects. Decision trees are widely adopted for their graphical representation, while rules have a textual representation. The main difference is that textual representation does not provide immediately information about the more relevant attributes of a rule. On the other hand, the hierarchical position of the features in a tree gives this kind of clue.
Attributes’ relative importance could be added to rules by means of positional information. Specifically, rule conditions are shown by following the order in which the rule extraction algorithm added them to the rule. Even though the representation of rules causes some difficulties in understanding the whole model, it enables the study of single rules representing partial parts of the whole knowledge (“local patterns”) which are composable."

SyntaxError: Python keyword not valid identifier in numexpr query

When I add feature names to the SkopeRules model, I encounter this error.

Some of the feature names are :

data__blocked_bugs_number
data__ever_affected=False
data__ever_affected=True
data__has_crash_signature=False
data__has_crash_signature=True
data__has_github_url=False
data__has_github_url=True
data__has_str=irrelevant
data__has_str=no

Traceback (most recent call last):
  File "run.py", line 55, in <module>
    model.train()
  File "C:\Users\Saurabh Daalia\Desktop\bugbug\bugbug\model.py", line 101, in train
    self.skope_clf.fit(X_train, y_train)
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\skrules\skope_rules.py", line 350, in fit
    for r in set(rules_from_tree)]
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\skrules\skope_rules.py", line 350, in <listcomp>
    for r in set(rules_from_tree)]
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\skrules\skope_rules.py", line 600, in _eval_rule_perf
    detected_index = list(X.query(rule).index)
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3088, in query
    res = self.eval(expr, **kwargs)
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3203, in eval
    return _eval(expr, inplace=inplace, **kwargs)
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\pandas\core\computation\eval.py", line 294, in eval
    truediv=truediv)
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\pandas\core\computation\expr.py", line 749, in __init__
    self.terms = self.parse()
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\pandas\core\computation\expr.py", line 766, in parse
    return self._visitor.visit(self.expr)
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\pandas\core\computation\expr.py", line 327, in visit
    raise e
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\site-packages\pandas\core\computation\expr.py", line 321, in visit
    node = ast.fix_missing_locations(ast.parse(clean))
  File "C:\Users\Saurabh Daalia\Anaconda3\lib\ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
SyntaxError: Python keyword not valid identifier in numexpr query

sklearn.externals.six is Deprecated in Sklearn 0.23

Since sklearn.externals.six is deprecated as of version 0.23 (https://github.com/scikit-learn/scikit-learn/pull/12916/files), a fresh install of skopes with the latest version of sklearn will yield the following error:

ImportError: cannot import name 'six'

when running from skrules import SkopeRules, on line 12 of skope_rules.py.

Would it make sense to rely on the official version of six as the documentation suggests? If this is something the community is interested in, I'd be happy to open a PR.

The usage of six seems to be extremely minimal in this project, so the PR seems straightforward, assuming that I am not overlooking any complexities?

Not compatible with sklearn v1?

Minimal example:

>>> import sklearn
>>> sklearn.__version__
1.0.1
>>> import skrules
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-3-195b491d5645> in <module>
----> 1 import skrules

~/.virtualenvs/risk-modeling/lib/python3.9/site-packages/skrules/__init__.py in <module>
----> 1 from .skope_rules import SkopeRules
      2 from .rule import Rule, replace_feature_name
      3 
      4 __all__ = ['SkopeRules', 'Rule']

~/.virtualenvs/risk-modeling/lib/python3.9/site-packages/skrules/skope_rules.py in <module>
     10 from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
     11 from sklearn.ensemble import BaggingClassifier, BaggingRegressor
---> 12 from sklearn.externals import six
     13 from sklearn.tree import _tree
     14 

ImportError: cannot import name 'six' from 'sklearn.externals' (/home/mwintner/.virtualenvs/risk-modeling/lib/python3.9/site-packages/sklearn/externals/__init__.py)

According to some stackoverflow sources like this one, six is not in sklearn.externals beyond sklearn v0.23.

scoring methods name vs. sklearn API

As mentioned in #3 we have 3 different scoring methods: decision_function, rules_vote, score_top_rules.

In sklearn API, we have decision_function and score_samples, which are related through decision_function = score_samples + offset, offset being defined in a way that predict = decision_function > 0.

Maybe we should add a class parameter to chose one of these 3 functions at initialization, and return the chosen function in a method called score_samples (from which we can define decision_function) ?

The 3 actual scoring functions could be renamed e.g. _score_mean, _score_vote, _score_max and kept private?

Any other suggestions?

Skope Rules should accept any kind of feature name

SkopeRules uses pandas.eval method for evaluating semantic rules. It leads to error when features have meaningful characters in their name (eg: (,)=- ).
For example :

from sklearn.datasets import load_iris
from skrules import SkopeRules
dataset = load_iris()

X, y, features_names = dataset.data, dataset.target, dataset.feature_names
y = (y == 0)  # Predicting the first specy vs all
clf = SkopeRules(max_depth_duplication=2,
                 n_estimators=30,
                 precision_min=0.3,
                 recall_min=0.1,
                 feature_names=features_names)
clf.fit(X, y)

will lead to following error :

Traceback (most recent call last):
  File "main.py", line 20, in <module>
    clf.fit(X, y)
  File "/usr/local/lib/python3.6/site-packages/skrules/skope_rules.py", line 350, in fit
    for r in set(rules_from_tree)]
  File "/usr/local/lib/python3.6/site-packages/skrules/skope_rules.py", line 350, in <listcomp>
    for r in set(rules_from_tree)]
  File "/usr/local/lib/python3.6/site-packages/skrules/skope_rules.py", line 600, in _eval_rule_perf
    detected_index = list(X.query(rule).index)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 2297, in query
    res = self.eval(expr, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 2366, in eval
    return _eval(expr, inplace=inplace, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/computation/eval.py", line 290, in eval
    truediv=truediv)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 732, in __init__
    self.terms = self.parse()
  File "/usr/local/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 749, in parse
    return self._visitor.visit(self.expr)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/computation/expr.py", line 310, in visit
    node = ast.fix_missing_locations(ast.parse(clean))
  File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
    petal length (cm )<=2.5999999046325684

Skope Rules should accept any kind of feature name. It means we have to transform feature name for computation and transforming it back at the end.

wrong link to README in the docs

In http://skope-rules.readthedocs.io/en/latest/, the link to the README points to the old repo.

How to cite this method in an academic paper?

pip install skope-rules brings the previous version of the code before fixing issue #47

Hi,

I made a test installation of the library using pip install skope-rules (as indicated in the original Github repo), and the installation brings the previous version of the code using mask = ~samples instead of mask = ~indices_to_mask(samples, n_samples). This was an error on the OOB calculation of each base estimator, which I believe was corrected on issue #47.

Thanks!

rules can be generated for multi-class classification?

how can I get IF-ELSE rule for multi-class classification, like rules for individual classes. thanks

what is nb?

what is nb?
https://skope-rules.readthedocs.io/en/latest/skope_rules.html

For example

skope_rules_clf.rules_

[('Pclass <= 2.5 and isFemale > 0.5', (0.9527320854603895, 0.5283115637180831, 6))]

rules_(dict of tuples (rule, precision, recall, nb).) The collection of n_estimators rules used in the predict method.

The rules are generated by fitted sub-estimators (decision trees).

Each rule satisfies recall_min and precision_min conditions.

The selection is done according to OOB precisions.estimators_(list of DecisionTreeClassifier)

The collection of fitted sub-estimators used to generate candidate rules.estimators_samples_(list of arrays)

The subset of drawn samples (i.e., the in-bag samples) for each base estimator.estimators_features_(list of arrays)

The subset of drawn features for each base estimator.max_samples_(integer)

The actual number of samplesn_features_(integer)

The number of features when fit is performed.classes_(array, shape (n_classes,))

The classes labels.

Release new version to pypi.org?

There are a number of useful commits on the master branch, e.g. #24.

It's been more than 1.5 years since the last release. Would it be possible for you to upload a new package to pypi.org?

remove n_jobs=1 default

Even when n_jobs is not passed, skope-rules still uses joblib as per the logs. This is because n_jobs defaults to 1 within the SkopeRules class:

skope-rules/skrules/skope_rules.py

Line 152 in e7f7b93

n_jobs=1,

skope-rules/skrules/skope_rules.py

Line 169 in e7f7b93

self.n_jobs = n_jobs

skope-rules/skrules/skope_rules.py

Line 280 in e7f7b93

n_jobs=self.n_jobs,

This should default to None and not be passed into the BaggingClassifier and BaggingRegressor if None to prevent triggering joblib. Something like

extra_kwargs = {)
if self.n_jobs:
    extra_kwargs = {'n_jobs': self.n_jobs}
bagging_clf = BaggingClassifier(..., ..., **extra_kwargs)

If there's an easier way to do ^ please let me know.

This will prevent joblib from triggering at all in the case that n_jobs is None. Much easier to debug parallel processing issues like #18 when I can enable/disable joblib entirely.

Happy to submit a PR for this!

can you share some description how it works from theoretical point of view and comparative performance vs known other repos like

can you share some description how it works from theoretical point of view and
comparative performance vs known other repos like

https://github.com/fingoldin/pycorels

https://github.com/csinva/interpretability-implementations-demos/tree/master/imodels/bayesian_rule_lists

https://pypi.org/project/pyarc/

https://github.com/jirifilip/pyIDS

etc...

Site Skope-Rules to scientific papers

How can I site Skope-rules to a scientific paper work?

Error while loading credit data

I was trying to read credit data from the example page. Pandas cannot read from excel sheet because of the incorrect variable name sheetname and should be renamed to sheet_name.

Adding the log trace for Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-d1105f71ff0d> in <module>
      1 # Importing data
----> 2 dataset = load_credit_data()
      3 X = dataset.data
      4 y = dataset.target
      5 # Shuffling data, preparing target and variables

/opt/conda/lib/python3.7/site-packages/skrules/datasets/credit_data.py in load_credit_data()
     37 
     38     data = pd.read_excel(join(sk_data_dir, archive.filename),
---> 39                          sheetname='Data', header=1)
     40 
     41     dataset = Bunch(

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297 
    298         return wrapper

TypeError: read_excel() got an unexpected keyword argument 'sheetname'

TerminatedWorkerError

I keep running into a TerminatedWorkerError when running clf.fit with skope rules. I seem to have ample memory so I'm unsure what's going on. Any potential ideas?

Traceback (most recent call last):
  File "experiment.py", line 171, in <module>
    result = process(topic)
  File "experiment.py", line 95, in process
    clf.fit(features, training_data_labels)
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/skrules/skope_rules.py", line 312, in fit
    clf.fit(X, y)
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/ensemble/bagging.py", line 244, in fit
    return self._fit(X, y, self.max_samples, sample_weight=sample_weight)
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/ensemble/bagging.py", line 378, in _fit
    for i in range(n_jobs))
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 930, in __call__
    self.retrieve()
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
sklearn.externals.joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

Update/Alias for imports for latest sklearn version and python >3.10

As per #57 and also https://stackoverflow.com/questions/72032032/importerror-cannot-import-name-iterable-from-collections-in-python

The path for collections.Iterable (python >3.10 ) and sklearn.externals.six (sklearn ) is updated

Currently, for it to play nice with the updated packages, I had to aliases the path

import numpy as np
import collections.abc
import six
import sklearn
collections.Iterable = collections.abc.Iterable
sklearn.externals.six = six
from skrules import SkopeRules

Will there be able update for the path/checks?
perhaps, something like

try:
    from collections.abc import Iterable
except ImportError:
    from collections import Iterable

or another if that checks the versions

I can submit a PR for it if you guys are up for it

cannot import name 'ScopeRules' from 'skrules'

Hi!

The package import spell, which is clearly described in the package readme, does not work

Six imported. What should I do to make the package work?

The oob score

I think the oob score computed in the fit function is wrong.

The authors get the oob sample indices by "mask = ~samples", and then apply X[mask, :] to get the oob samples.
Actually, I test the case and found that there are many same elements between samples and X[mask,:], and the length of training samples and mask samples are the same. For example, if we totally have 100 samples, when 80 samples are used to train the model, then the length of oob samples should be 100-80=20 (without considering replacement).

I also turn to the implementation of sampling oob of randomforest, and I found following codes:

random_instance = check_random_state(random_state)
sample_indices = random_instance.randint(0, samples, max_samples) # get the indices of training samples
sample_counts = np.bincount(sample_indices, minlength=len(samples))
unsampled_mask = sample_counts == 0
indices_range = np.arange(len(samples))
unsampled_indices = indices_range[unsampled_mask] # get the indices of oob samples

then the unsampled_indices is the truely oob sample indices.

Questions about how to use and interpret rules?

Can SkopeRules be used for multiclass classification or only binary classification.
How do I interpret the outputted decision rules? Do the top-k rules in the example notebook correspond to the rules that best classify the test data, ordered in descending order by precision? If I want to classify new test data, do I consider the top-1 rule, the majority vote from the top-k rules, or some other approach?
If I want to understand the underlying method and how rules are computed, is Predictive Learning via Rule Ensembles by Friedman and Popescu the closest work?

.py~ included in tests

Just to spot that a temporary file is in the repo test_common.py~
.gitignore may be updated.

Possibility to round values displayed in rules

Would be good to add an argument to round feature values in .rules_, else it's usual to have floats very high precision, which is not convenient when working with a rule as string.
Or maybe change the format to have in sublists feature_name/operator/value to facilitate the use of these information in tables or graphs.

Performance optimization

Pandas querying is very slow and can be easily replaced with traditional indexing.
Here is the code that cause the bottleneck:

def _eval_rule_perf(self, rule, X, y):
      detected_index = list(X.query(rule).index)

Profiling results:

1141.451 _eval_rule_perf  skrules/skope_rules.py:614
         └─ 1140.967 query  pandas/core/frame.py:3316

An example of improved version:

tmp = X
for part_rule in rule.split('and '):
    part_rule = part_rule.strip()
    sign = '==' if '>' in part_rule else '!='
    tmp = tmp[tmp[part_rule.split()[0]] == 1 if sign == '==' else tmp[part_rule.split()[0]] != 1]

Note, this is the code for a binary case, it should be changed to a more generic version.

Profiling results

 8.658 <listcomp>  skrules/skope_rules.py:357
         └─ 8.609 _eval_rule_perf  skrules/skope_rules.py:614
            └─ 6.739 __getitem__  pandas/core/frame.py:2987

Classification vs regression

Hi
I think this package looks fantastic. I am wondering, however, what your plans are for implementing SkopeRules for regression. Are there any plans?

I've made a start for adding regression, and I had to make a lot of changes. I made this up as I went through the code really. I had to come up with measures comparable to precision and recall - the precision-like measure is based on the expected reduction in standard deviation; the recall-like measure is based on the z-score of the prediction versus the population of y. At the end, scores are integrated via softmax weighted rules. At the moment, I still get a lot of nans in predictions, because there are not enough rules. The overall mse error is still much worse than a baseline from linear regression.

I've also added comments and a test for regression. This is WIP, but I am happy for anyone to jump in.

Thanks!

No rules are generated using single-feature dataset

Hey there,

I wanted to use this package to derive very basic rules for one feature and one label.
It works well for more than one feature, but using only one feature, the output is empty.
I also tried with the iris dataset and the code looks like this:

from skrules import SkopeRules

dataset = load_iris()
feature_names = ['sepal_length']
clf = SkopeRules(max_depth_duplication=2,
                 n_estimators=30,
                 precision_min=0.3,
                 recall_min=0.1,
                 feature_names=feature_names)

for idx, species in enumerate(dataset.target_names):
    X, y = dataset.data, dataset.target
    clf.fit(X[::,0].reshape(-1, 1), y == idx)
    rules = clf.rules_[0:3]
    print("Rules for iris", species)
    for rule in rules:
        print(rule)
    print()
    print(20*'=')
    print()```

why is numpy scipy required during installation in setup.py

can we remove the following from setup.py? (happy to submit a PR)

try:
    import numpy
except ImportError:
    print('numpy is required during installation')
    sys.exit(1)

try:
    import scipy
except ImportError:
    print('scipy is required during installation')
    sys.exit(1)

pipenv uses python setup.py egg_info to parse the dependency tree and this command fails on initial run when numpy and scipy are not yet installed. fortunately, it does a second pass over failed dependencies and then is able to install skope-rules

why is specifying numpy and scipy as a dependency not sufficient? is there a reason it is "required during installation"?

scikit-learn-contrib / skope-rules Goto Github PK

skope-rules's Introduction

scikit-learn-contrib

Vision

Projects

skope-rules's People

Contributors

Stargazers

Watchers

Forkers

skope-rules's Issues

Profiling results:

Profiling results

Recommend Projects

Recommend Topics

Recommend Org