Coder Social home page Coder Social logo

parrt / random-forest-importances Goto Github PK

View Code? Open in Web Editor NEW
595.0 22.0 130.0 14.94 MB

Code to compute permutation and drop-column importances in Python scikit-learn models

License: MIT License

Jupyter Notebook 65.83% CSS 0.06% HTML 33.09% Python 1.03%

random-forest-importances's Introduction

Feature importances for scikit-learn machine learning models

By Terence Parr and Kerem Turgutlu. See Explained.ai for more stuff.

The scikit-learn Random Forest feature importances strategy is mean decrease in impurity (or gini importance) mechanism, which is unreliable. To get reliable results, use permutation importance, provided in the rfpimp package in the src dir. Install with:

pip install rfpimp

We include permutation and drop-column importance measures that work with any sklearn model. Yes, rfpimp is an increasingly-ill-suited name, but we still like it.

Description

See Beware Default Random Forest Importances for a deeper discussion of the issues surrounding feature importances in random forests (authored by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard).

The mean-decrease-in-impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within random forests. The problem is that this mechanism, while fast, does not always give an accurate picture of importance. Strobl et al pointed out in Bias in random forest variable importance measures: Illustrations, sources and a solution that “the variable importance measures of Breiman's original random forest method ... are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.”

A more reliable method is permutation importance, which measures the importance of a feature as follows. Record a baseline accuracy (classifier) or R2 score (regressor) by passing a validation set or the out-of-bag (OOB) samples through the random forest. Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R2. The importance of that feature is the difference between the baseline and the drop in overall accuracy or R2 caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.

Sample code

See the notebooks directory for things like Collinear features and Plotting feature importances.

Here's some sample Python code that uses the rfpimp package contained in the src directory. The data can be found in rent.csv, which is a subset of the data from Kaggle's Two Sigma Connect: Rental Listing Inquiries competition.

from rfpimp import *
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

df_orig = pd.read_csv("/Users/parrt/github/random-forest-importances/notebooks/data/rent.csv")

df = df_orig.copy()

# attentuate affect of outliers in price
df['price'] = np.log(df['price'])

df_train, df_test = train_test_split(df, test_size=0.20)

features = ['bathrooms','bedrooms','longitude','latitude',
            'price']
df_train = df_train[features]
df_test = df_test[features]

X_train, y_train = df_train.drop('price',axis=1), df_train['price']
X_test, y_test = df_test.drop('price',axis=1), df_test['price']
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test) # permutation
viz = plot_importances(imp)
viz.view()


df_train, df_test = train_test_split(df_orig, test_size=0.20)
features = ['bathrooms','bedrooms','price','longitude','latitude',
            'interest_level']
df_train = df_train[features]
df_test = df_test[features]

X_train, y_train = df_train.drop('interest_level',axis=1), df_train['interest_level']
X_test, y_test = df_test.drop('interest_level',axis=1), df_test['interest_level']
# Add column of random numbers
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestClassifier(n_estimators=100,
                            min_samples_leaf=5,
                            n_jobs=-1,
                            oob_score=True)
rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test, n_samples=-1)
viz = plot_importances(imp)
viz.view()

Feature correlation

See Feature collinearity heatmap. We can get the Spearman's correlation matrix:

Feature dependencies

The features we use in machine learning are rarely completely independent, which makes interpreting feature importance tricky. We could compute correlation coefficients, but that only identifies linear relationships. A way to at least identify if a feature, x, is dependent on other features is to train a model using x as a dependent variable and all other features as independent variables. Because random forests give us an easy out of bag error estimate, the feature dependence functions rely on random forest models. The R^2 prediction error from the model indicates how easy it is to predict feature x using the other features. The higher the score, the more dependent feature x is.

You can also get a feature dependence matrix / heatmap that returns a non-symmetric data frame where each row is the importance of each var to the row's var used as a model target. Example:

random-forest-importances's People

Contributors

chrispaulca avatar escherba avatar feribg avatar gilesstrong avatar keremturgutlu avatar marcotama avatar matheusccouto avatar mk-bldn avatar parrt avatar rohanbhandari avatar sharpen6 avatar tjpell avatar yskmt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

random-forest-importances's Issues

Add multiprocessing for oob_importances

Wanted to suggest parallelizing the oob importance calculation in order to speed it up, since the importances can be calculated independently for each feature.

In my use case, I saw a >8x reduction in runtime after parallelizing and would be happy to contribute the code needed to implement this.

Incorrect references to sklearn?

Hello,

I have rfpimp ver 1.3.6 installed as well as sklearn 0.24.1. When I ran a script that used them, I got this error
File "C:...\anaconda3\envs...\lib\site-packages\rfpimp.py", line 16, in
from sklearn.ensemble.forest import _generate_unsampled_indices
ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

I dug into it and found that sklearn.ensemble.forest is, in my version, sklearn.ensemble._forest and _generate_unsampled_indices does reside there.
While it's possible that something is wrong on my end, my guess is that sklearn has changed? I may change rfpimp.py on my own to match sklearn. I hope it doesn't break my computer. Thanks!

Use df.corr('spearman') instead of scipy.stats.spearmanr, to deal with dfs with NaNs

I'm dealing with a dataset of answers to a questionnaire. Each person receives 8 questions out of 12 possible ones, so there's a 33% of NaNs. It's like this
image

To pass the quiz, you need to get 5 right. I'm trying to figure out the importance of each one of those questions to reduce the number of questions and make it shorter, while still having relevant information.

I tried to use the plot_corr_heatmap function, but it gives errors because the input DataFrame can't contain NaN values. For this step of calculating the correlation between variables, I don't think it's right to fillna them with another value, as that would distort the data in a significant way (even tho I did it to train a Random Forest Classifier on it).

I saw that plot_corr_heatmap is calling internally scipy.stats.spearmanr, which doesn't accept NaNs.

But pandas DataFrames have a df.corr() method that can take 'spearman' as a parameter, and accept NaNs, so that would make plot_corr_heatmap usable in cases like this

What do you think, @parrt? Should I submit a PR?

image

score() takes from 2 to 3 positional arguments but 4 were given: dropcol_importance

Hello,

I am trying to find important features from trained CatBoost model using dropcol_importance. I have already tried dropcol_importance with SVC and XGBoost and it is working fine but with CatBoost model, I am facing an error that I am not able to debug.

My code is as follows:

def model_train(clf, X_train,X_test,y_train,y_test,i):
    clf.fit(X_train, y_train)#
    model_name = 'fold_' + str(i) + '_catboost.sav'
    pickle.dump(clf,open(model_name,'wb'))
    scores = score_computation(clf,X_test,y_test,X_train,y_train)
    return scores

weights = {0:1, 1:19.22}
clf = CatBoostClassifier(depth=10,
                         learning_rate=0.1,
                         l2_leaf_reg = 2,
                         n_estimators = 40,
                         loss_function='Logloss',
                         thread_count = 15,
                         verbose=False,
                         task_type="GPU",
                         devices='0:1',
                         class_weights = weights)
print('************************************')
print('Now starting the training of Fold-',i,' ....')
X_train, X_test = pd.DataFrame(X[train_index]), pd.DataFrame(X[test_index])
y_train, y_test = Y[train_index], Y[test_index]
%time scores = model_train(clf,X_train,X_test,y_train,y_test,i)
print('Model training is complete. Now getting best features...')
%time imp = dropcol_importances(clf,X_train,y_train,X_test,y_test)

The error message is: score() takes from 2 to 3 positional arguments but 4 were given. Here is the traceback.

<timed exec> in <module>

~/miniconda3/lib/python3.7/site-packages/rfpimp.py in dropcol_importances(model, X_train, y_train, X_valid, y_valid, metric, sample_weights)
     330         baseline = metric(model_, X_valid, y_valid, sample_weights)
     331     else:
--> 332         baseline = model_.score(X_valid, y_valid, sample_weights)
    333     imp = []
    334     for col in X_train.columns:

TypeError: score() takes from 2 to 3 positional arguments but 4 were given`

I also tried with dropcol_importances(clf,X_train,y_train) but no luck. I wonder the same script for dropcol_importance is working with other models but not with CatBoost. Please suggest how to fix this. Thanks.

Why is this package using R2 as the criterion to evaluate error rate for regression?

There was a classical package in R named "randomForest" based on Leo Breiman and Adele Cutler's Fortran code. They used mean-squared-error as the criterion to evaluate the variable importance before and after permutation. I have read the python code for this package. I find the criterion is R2 which is different with the R package. Is the R2 criterion more better? Or, I just misunderstand the python code. Thanks a lot in advance for answering my questions!

Feature importance is zero!!!

I am using a dataset to compute feature importance using permutation. Have checked results with R implementation, I am getting non zero var importance. What could be the reason? Here is my code

from rfpimp import *
from sklearn.ensemble.forest import _generate_unsampled_indices

# TODO: add arg for subsample size to compute oob score

def oob_classifier_accuracy(rf, X_train, y_train):
   X = X_train.values
    y = y_train.values

    n_samples = len(X)
    n_classes = len(np.unique(y))
    predictions = np.zeros((n_samples, n_classes))
    for tree in rf.estimators_:
        unsampled_indices = _generate_unsampled_indices(tree.random_state, n_samples)
        tree_preds = tree.predict_proba(X[unsampled_indices, :])
       predictions[unsampled_indices] += tree_preds

    predicted_class_indexes = np.argmax(predictions, axis=1)
    predicted_classes = [rf.classes_[i] for i in predicted_class_indexes]

    oob_score = np.mean(y == predicted_classes)
    return oob_score

def permutation_importances(rf, X_train, y_train, metric):
    """
    Return importances from pre-fit rf; metric is function
    that measures accuracy or R^2 or similar. This function
    works for regressors and classifiers.
    """
    baseline = metric(rf, X_train, y_train)
    imp = []
    for col in X_train.columns:
        save = X_train[col].copy()
        X_train[col] = np.random.permutation(X_train[col])
        m = metric(rf, X_train, y_train)
        X_train[col] = save
        imp.append(baseline - m)
    return np.array(imp)
rf = clone(base_rf)
rf.fit(X_train, y_train)
oob = oob_classifier_accuracy(rf, X_train, y_train)
print("oob accuracy",oob)

imp = permutation_importances(rf, X_train, y_train,
                              oob_classifier_accuracy)
imp

Gives an output of:

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

I also computed the oob_classiifer_accuracy() by permuting all the variables, the accuracy reported doesn't change at all. The event rate is data is rather low around 5%.

ERROR: Error checking for conflicts.

I had thsi happening in a notebook while trying to pip install

Not sure what's causing it?

ERROR: Error checking for conflicts.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3021, in _dep_map
    return self.__dep_map
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2815, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3012, in _parsed_pkg_info
    return self._pkg_info
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2815, in __getattr__
    raise AttributeError(attr)
AttributeError: _pkg_info

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/commands/install.py", line 568, in _warn_about_conflicts
    package_set, _dep_info = check_install_conflicts(to_install)
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/check.py", line 114, in check_install_conflicts
    package_set, _ = create_package_set_from_installed()
  File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/check.py", line 53, in create_package_set_from_installed
    package_set[name] = PackageDetails(dist.version, dist.requires())
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2736, in requires
    dm = self._dep_map
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3023, in _dep_map
    self.__dep_map = self._compute_dependencies()
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3032, in _compute_dependencies
    for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3014, in _parsed_pkg_info
    metadata = self.get_metadata(self.PKG_INFO)
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1420, in get_metadata
    value = self._get(path)
  File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1616, in _get
    with open(path, 'rb') as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/dist-packages/~-rtifi-2020.6.20.dist-info/METADATA'

ImportError: cannot import name plot_corr_heatmap

Using the library, I get the following error:

ImportError: cannot import name plot_corr_heatmap

This function does not exist in my rfpimp library. Am I using the older version or am missing something? I followed the instructions in here.

What's the meaning of <0 values?

Just tested the code from index page, some of my features have negative values, does it mean reverse-related to target feature or something else? Thank you.

'deep' is an invalid keyword argument for this function

I am trying to get the feature importance of my random forest model but i keep getting the following error:

'deep' is an invalid keyword argument for this function

Below is the entire error output:

TypeError Traceback (most recent call last)
in ()
1 #getting imortance for features using permutation importance
2
----> 3 perm_imp_rfpimp_rf10 = permutation_importances(rf_10, train_features_x, train_labels_y, rdt10)
4 perm_imp_rfpimp_rf100 = permutation_importances(rf_100, train_features_x, train_labels_y, rdt100)
5 perm_imp_rfpimp_rf1000 = permutation_importances(rf_1000, train_features_x, train_labels_y, rdt1000)

~/anaconda2/envs/py36/lib/python3.6/site-packages/rfpimp.py in permutation_importances(rf, X_train, y_train, metric, n_samples)
286
287 def permutation_importances(rf, X_train, y_train, metric, n_samples=5000):
--> 288 imp = permutation_importances_raw(rf, X_train, y_train, metric, n_samples)
289 I = pd.DataFrame(data={'Feature':X_train.columns, 'Importance':imp})
290 I = I.set_index('Feature')

~/anaconda2/envs/py36/lib/python3.6/site-packages/rfpimp.py in permutation_importances_raw(rf, X_train, y_train, metric, n_samples)
403
404 baseline = metric(rf, X_sample, y_sample)
--> 405 X_train = X_sample.copy(deep=False,axes=True) # shallow copy
406 y_train = y_sample
407 imp = []

TypeError: 'deep' is an invalid keyword argument for this function

My inputs involve providing a function based metric as below:

def rdt10(rf_10,train_features_x, train_labels_y):
return r2_score(train_labels_y, rf_10.predict(train_features_x))

def rdt100(rf_100,train_features_x, train_labels_y):
return r2_score(train_labels_y, rf_100.predict(train_features_x))

def rdt1000(rf_1000,train_features_x, train_labels_y):
return r2_score(train_labels_y, rf_1000.predict(train_features_x))

and then calling it in the permutation importance function below (this is what gives the error output from above):

perm_imp_rfpimp_rf10 = permutation_importances(rf_10, train_features_x, train_labels_y, rdt10)
perm_imp_rfpimp_rf100 = permutation_importances(rf_100, train_features_x, train_labels_y, rdt100)
perm_imp_rfpimp_rf1000 = permutation_importances(rf_1000, train_features_x, train_labels_y, rdt1000)

rf_10, rf_100, rf_1000 are my random forest models using 10, 100, and 1000 estimators.

Please help me figure out how to address this error:

Feature correlation p-values and correction methods

Wanted to get the conversation open on feature correlation, right now it just does a naive spearmanr, with no insight into the resulting p-values. Would be great to do a few things, listed below in order of importance:

  1. Introduce p-values and maybe apply the appropriate cutoffs
  2. Introduce permutation based correlation, starting off with lagged correlations for example (context is time series analysis)
  3. Introduce a probability correction method for 1 and/or 2 such as bonferroni, to account for the number of correlation estimates we're doing between features and between number of lags if we end up implementing #2.

Happy to get the conversation going and see where we end up. Right now the feature correlation estimation is not quite stable in the context of very noisy time series data.

Questions Regarding Alternative Feature Importance

sparse matrix as input

HI. First of all this work is amazing. Thank you for this contribution and for explaining the issues with RF default importance so clearly.

I am working with a large dataset (term-document matrix) and am using a sparse matrix (scipy csr) as my input for the model. I wanted to know if you have any suggestions about how to implement the dropcol_importances functionality without changing to a pandas df (too large). The 'rows' of the csr are my features.

Currently thinking I will need to convert to a COO, remove all the rows with feature (x), then convert back to sparse matrix for each iteration.

I'm currently searching for efficient ways. and thought i would ask you about it.
Thank you again for all your contributions.
Regards,
Summer

Full importances function always uses model.score

I was wondering if a PR that changes that behaviour to instead use the score form a passed metric function would be better, while still defaulting to model.score if not set. This is the current behaviour for permuation_importances_raw, however that function doesn't support grouping of the features. In many cases accuracy or R2 are not the most suitable scores.

No module named 'sklearn.ensemble.forest' in scikit-learn 1.4.1

Calling rfpimp.plot_corr_heatmap() errors out with:

File <redacted>/python3.12/site-packages/rfpimp.py:15
     13 from sklearn.ensemble import RandomForestClassifier
     14 from sklearn.ensemble import RandomForestRegressor
---> 15 from sklearn.ensemble.forest import _generate_unsampled_indices
     16 from sklearn.ensemble import forest
     17 from sklearn.model_selection import cross_val_score

ModuleNotFoundError: No module named 'sklearn.ensemble.forest'
  • rfpimp version is 1.3.2 installed with conda from conda-forge

Dropcol depends on OOB score

It seems that the dropcol importance depends on the oob_score which might not be available or accurate in many cases depending on the problem. Would it make sense to submit a PR that checks for the full model metric vs constrained model metric on a validation set and not OOB?

Positional arguments for 'X_Valid' and 'y_Valid' in dropcol_importances()

Hi,

In case of dropcol_importances() I believe that the validation sets should be keyword arguments, such as metric and in case they are not specified the score should be calculated on the training data. This would be in line with what is done for example for permutation_importances().

Also the examples given for dropcol_importances() are no longer working with the current positional arguments.

Please correct me if I'm wrong.

SyntaxError in python2.7

Syntax error in python2.7 (it does work python3). If rfpimp is not supposed to work in 2.7, you might want to consider mentioning it in the README

import rfpimp

File "/Users/diegomazon/anaconda/lib/python2.7/site-packages/rfpimp.py", line 40
self.svgfilename = f"{tmp}/PimpViz_{getpid()}.svg"
^
SyntaxError: invalid syntax

Varying Dependency Value

When I use "feature_dependence_matrix" function to get the dependency of each independent variables, the values change every time I run the code. Specifying the number of random_state only allow me to obtain constant overall dependency regardless how many times I run the code, but the individual dependency is still changing.

Is there any way I could obtain fix individual dependency values every time?

Thanks!

Plotting correlation heatmap with 2 features only causes error

Calling rfpimp.plot_corr_heatmap with data that has only 2 features causes a ValueError: input array must be 2-d.

import pandas as pd
from rfpimp import plot_corr_heatmap
df_all = pd.read_csv("rent.csv")
features = ['bathrooms','bedrooms']
df = df_all[features]
plot_corr_heatmap(df)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-bdbea3cdef5a> in <module>
      4 features = ['bathrooms','bedrooms']
      5 df = df_all[features]
----> 6 plot_corr_heatmap(df)

~/anaconda3/envs/cxxi/lib/python3.7/site-packages/rfpimp.py in plot_corr_heatmap(df, color_threshold, cmap, figsize, value_fontsize, label_fontsize, precision, xrot)
    851     filtered = np.abs(filtered)  # work with abs but display negatives later
    852     mask = np.ones_like(corr)
--> 853     filtered[np.triu_indices_from(mask)] = -9999
    854 
    855     if cmap is None:

~/anaconda3/envs/cxxi/lib/python3.7/site-packages/numpy/lib/twodim_base.py in triu_indices_from(arr, k)
    999     """
   1000     if arr.ndim != 2:
-> 1001         raise ValueError("input array must be 2-d")
   1002     return triu_indices(arr.shape[-2], k=k, m=arr.shape[-1])

ValueError: input array must be 2-d

I figured this is due to a nuance of scipy.stats.spearmanr, which returns a scalar if only 2 variables are passed. Quoting the documentation: "Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters."

TypeError: 'numpy.ndarray' object is not callable

Hi, trying to feed in my dfs into the example given in the readme and i get the following error:

TypeError: 'numpy.ndarray' object is not callable

I have verified i am feeding in dfs not np arrays, so unsure what's going on.

Thanks,
Z

ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

Hello, I am trying to import rfpimp however I am met by the error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-133-c95d15dec9fe> in <module>
     24 import matplotlib.patheffects as PathEffects
     25 from pandas.plotting import lag_plot
---> 26 from rfpimp import *
     27 
     28 # Machine Learning libraries

~/opt/anaconda3/lib/python3.8/site-packages/rfpimp.py in <module>
     13 from sklearn.ensemble import RandomForestClassifier
     14 from sklearn.ensemble import RandomForestRegressor
---> 15 from sklearn.ensemble.forest import _generate_unsampled_indices
     16 from sklearn.ensemble import forest
     17 from sklearn.model_selection import cross_val_score

ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

It seems that sklearn.ensemble.forest was renamed to sklearn.ensemble._forest (see here)

I'd have to install an older version for sklearn however that would break other dependencies I have. Is there a fix around this? Thanks

An error occurred when the test file was run

I got an error running "permutation-importances-classifier", “forest” seems to be updated to “_forest” in sklearn. I changed "from sklearn.ensemble.forest import _generate_unsampled_indices" to "from sklearn.ensemble._forest import _generate_unsampled_indices" and it worked fine.

In the same code, "unsampled_indices = _generate_unsampled_indices(tree.random_state, n_samples)" shows missing "TypeError: _generate_unsampled_indices() missing 1 required positional argument: 'n_samples_bootstrap'" when running.
The function of _generate_unsampled_indices is defined as: "def _generate_unsampled_indices(random_state, n_samples, n_samples_bootstrap):".

FutureWarning: The sklearn.ensemble.forest module is deprecated in version 0.22 and will be removed in version 0.24

This import statement

from sklearn.ensemble.forest import _generate_unsampled_indices

is raising the following FutureWarning

 FutureWarning:
The sklearn.ensemble.forest module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.

This could easily be solved by implementing the simple function in rfpimp.

SyntaxError: invalid syntax

When import rfpimp, there is an error like below

"
File "/Users/yan/anaconda/lib/python3.5/site-packages/rfpimp.py", line 518
ax.xaxis.set_major_formatter(FormatStrFormatter(f'%.{xtick_precision}f'))
^
SyntaxError: invalid syntax

"

ERROR IN "plot_importances"

621 barcounts = np.array([f.count('\n')+1 for f in I.index])
AttributeError: 'int' object has no attribute 'count'

Error with oob_importances with scikit-learn 0.22.1

oob_importances internally uses _generate_unsampled_indices which is a private function within scikit-learn. In scikit-learn 0.22.1 the function signature of _generate_unsampled_indices has changed from
_generate_unsampled_indices(random_state, n_samples) to
_generate_unsampled_indices(random_state, n_samples, n_samples_bootstrap) .
This signature change can be seen here

'RandomForestRegressor' object has no attribute 'estimators_'

When I use the function 'oob_regression_r2_score()', "AttributeError: 'RandomForestRegressor' object has no attribute 'estimators_'"

def oob_regression_r2_score(rf, X_train, y_train):
"""
Compute out-of-bag (OOB) R^2 for a scikit-learn random forest
regressor. We learned the guts of scikit's RF from the BSD licensed
code:
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L702
"""
X = X_train.values if isinstance(X_train, pd.DataFrame) else X_train
y = y_train.values if isinstance(y_train, pd.Series) else y_train

n_samples = len(X)
predictions = np.zeros(n_samples)
n_predictions = np.zeros(n_samples)


for tree in rf.estimators_:
    unsampled_indices = _get_unsampled_indices(tree, n_samples)
    tree_preds = tree.predict(X[unsampled_indices, :])
    predictions[unsampled_indices] += tree_preds
    n_predictions[unsampled_indices] += 1

if (n_predictions == 0).any():
    warnings.warn("Too few trees; some variables do not have OOB scores.")
    n_predictions[n_predictions == 0] = 1

predictions /= n_predictions

oob_score = r2_score(y, predictions)
return oob_score

def permutation_importances(rf, X_train, y_train, metric):
baseline = metric(rf, X_train, y_train)
imp = []
for col in X_train.columns:
save = X_train[col].copy()
X_train[col] = np.random.permutation(X_train[col])
m = metric(rf, X_train, y_train)
X_train[col] = save
imp.append(baseline - m)
return np.array(imp)

rf = RandomForestRegressor(n_estimators=100)
imp = permutation_importances(rf, X_train, y_train, oob_regression_r2_score)

TypeError: _generate_unsampled_indices() missing 1 required positional argument: 'n_samples_bootstrap'

Hello there, after I installed the library, I tested the example code that "rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
X_train, y_train = ..., ...
rf.fit(X_train, y_train)
imp = oob_importances(rf, X_train, y_train)"

And it shows the error as below:

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(x, y)
imp = oob_importances(rf, x, y)

TypeError Traceback (most recent call last)
in
1 rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, oob_score=True)
2 rf.fit(x, y)
----> 3 imp = oob_importances(rf, x, y)

~/anaconda/envs/py36/lib/python3.6/site-packages/rfpimp.py in oob_importances(rf, X_train, y_train, n_samples)
231 """
232 if isinstance(rf, RandomForestClassifier):
--> 233 return permutation_importances(rf, X_train, y_train, oob_classifier_accuracy, n_samples)
234 elif isinstance(rf, RandomForestRegressor):
235 return permutation_importances(rf, X_train, y_train, oob_regression_r2_score, n_samples)

~/anaconda/envs/py36/lib/python3.6/site-packages/rfpimp.py in permutation_importances(rf, X_train, y_train, metric, n_samples)
282
283 def permutation_importances(rf, X_train, y_train, metric, n_samples=5000):
--> 284 imp = permutation_importances_raw(rf, X_train, y_train, metric, n_samples)
285 I = pd.DataFrame(data={'Feature':X_train.columns, 'Importance':imp})
286 I = I.set_index('Feature')

~/anaconda/envs/py36/lib/python3.6/site-packages/rfpimp.py in permutation_importances_raw(rf, X_train, y_train, metric, n_samples)
398 rf.fit(X_sample, y_sample)
399
--> 400 baseline = metric(rf, X_sample, y_sample)
401 X_train = X_sample.copy(deep=False) # shallow copy
402 y_train = y_sample

~/anaconda/envs/py36/lib/python3.6/site-packages/rfpimp.py in oob_classifier_accuracy(rf, X_train, y_train)
427 predictions = np.zeros((n_samples, n_classes))
428 for tree in rf.estimators_:
--> 429 unsampled_indices = _generate_unsampled_indices(tree.random_state, n_samples)
430 tree_preds = tree.predict_proba(X[unsampled_indices, :])
431 predictions[unsampled_indices] += tree_preds

TypeError: _generate_unsampled_indices() missing 1 required positional argument: 'n_samples_bootstrap'

Could you please help on this?

Many thanks

Yan

Overload inbuilt sklearn feature_importance for use with RFECV

I'm currently muddling through learning my way around the pitfalls of various importance measures, and one thing I'm aiming for (along with everyone else) is more automated feature selection. To this end

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

is very handy but it would be nice to be able to use the definitions (specifically the drop/permute importances) from rfpimp there - what would be the best way to go about this? One way would seem to be to overload the sklearn RF itself or a branch of it, but perhaps more useful and modular would be to fork the RFECV into rfpimp and use it from there.

Thanks!
Z

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.