smazzanti / mrmr Goto Github PK

View Code? Open in Web Editor NEW

509.0 8.0 78.0 299 KB

mRMR (minimum-Redundancy-Maximum-Relevance) for automatic feature selection at scale.

License: MIT License

Python 100.00%

machine-learning feature-selection data-science mlops

mrmr's Introduction

What is mRMR

mRMR, which stands for "minimum Redundancy - Maximum Relevance", is a feature selection algorithm.

Why is it unique

The peculiarity of mRMR is that it is a minimal-optimal feature selection algorithm.
This means it is designed to find the smallest relevant subset of features for a given Machine Learning task.

Selecting the minimum number of useful features is desirable for many reasons:

memory consumption,
time required,
performance,
explainability of results.

This is why a minimal-optimal method such as mrmr is often preferable.

On the contrary, the majority of other methods (for instance, Boruta or Positive-Feature-Importance) are classified as all-relevant, since they identify all the features that have some kind of relationship with the target variable.

When to use mRMR

Due to its efficiency, mRMR is ideal for practical ML applications, where it is necessary to perform feature selection frequently and automatically, in a relatively small amount of time.

For instance, in 2019, Uber engineers published a paper describing how they implemented mRMR in their marketing machine learning platform Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform.

How to install this package

You can install this package in your environment via pip:

pip install mrmr_selection

And then import it in Python through:

import mrmr

How to use this package

This package is designed to do mRMR selection through different tools, depending on your needs and constraints.

Currently, the following tools are supported (others will be added):

Pandas
Polars
Spark
Google BigQuery

The package has a module for each supported tool. Each module has at least these 2 functions:

mrmr_classif, for feature selection when the target variable is categorical (binary or multiclass).
mrmr_regression, for feature selection when the target variable is numeric.

Let's see some examples.

1. Pandas example

You have a Pandas DataFrame (X) and a Series which is your target variable (y). You want to select the best K features to make predictions on y.

# create some pandas data
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)

# select top 10 features using mRMR
from mrmr import mrmr_classif
selected_features = mrmr_classif(X=X, y=y, K=10)

Note: the output of mrmr_classif is a list containing K selected features. This is a ranking, therefore, if you want to make a further selection, take the first elements of this list.

2. Polars example

# create some polars data
import polars
data = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3), 
        (2.0, None, 2.0, 7.0, 8.5, 6.7), 
        (2.0, None, 3.0, 7.0, -2.3, 4.4),
        (3.0, 4.0, 3.0, 7.0, 0.0, 0.0),
        (4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]
columns = ["target", "some_null", "feature", "constant", "other_feature", "another_feature"]
df_polars = polars.DataFrame(data=data, schema=columns)

# select top 2 features using mRMR
import mrmr
selected_features = mrmr.polars.mrmr_regression(df=df_polars, target_column="target", K=2)

3. Spark example

# create some spark data
import pyspark
session = pyspark.sql.SparkSession(pyspark.context.SparkContext())
data = [(1.0, 1.0, 1.0, 7.0, 1.5, -2.3), 
        (2.0, float('NaN'), 2.0, 7.0, 8.5, 6.7), 
        (2.0, float('NaN'), 3.0, 7.0, -2.3, 4.4),
        (3.0, 4.0, 3.0, 7.0, 0.0, 0.0),
        (4.0, 5.0, 4.0, 7.0, 12.1, -5.2)]
columns = ["target", "some_null", "feature", "constant", "other_feature", "another_feature"]
df_spark = session.createDataFrame(data=data, schema=columns)

# select top 2 features using mRMR
import mrmr
selected_features = mrmr.spark.mrmr_regression(df=df_spark, target_column="target", K=2)

4. Google BigQuery example

# initialize BigQuery client
from google.cloud.bigquery import Client
bq_client = Client(credentials=your_credentials)

# select top 20 features using mRMR
import mrmr
selected_features = mrmr.bigquery.mrmr_regression(
    bq_client=bq_client,
    table_id='bigquery-public-data.covid19_open_data.covid19_open_data',
    target_column='new_deceased',
    K=20
)

Reference

For an easy-going introduction to mRMR, read my article on Towards Data Science: “MRMR” Explained Exactly How You Wished Someone Explained to You.

Also, this article describes an example of mRMR used on the world famous MNIST dataset: Feature Selection: How To Throw Away 95% of Your Data and Get 95% Accuracy.

mRMR was born in 2003, this is the original paper: Minimum Redundancy Feature Selection From Microarray Gene Expression Data.

Since then, it has been used in many practical applications, due to its simplicity and effectiveness. For instance, in 2019, Uber engineers published a paper describing how they implemented MRMR in their marketing machine learning platform Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform.

mrmr's People

Contributors

Stargazers

Watchers

Forkers

torreaopt sebasu11 anmard7 rohankarthikeyan apachaves kmedved chandpes divyasree19991022 menonpg surajiyer liyang328 gsiisg donjoe42 hamedmx lihaicheng7003 tanvirehsan kaixinhuaihuai xiaoxingxingheshang uchihaitachi-1 branttq 19stevejobs88 kang2000h giorgiopiatti enjoytoshare viniciusmsousa d0542090 valeman brendk longshen931 samuele-mazzanti webclinic017 damiankucharski sandy4321 naveen-marthala zxy1zxy samyhad dvtruongson successful123456 filipj8 tim-cashion thunder176 jose-marcos-meli nauval123 sisifo3 ollawone sunone5 techthiyanes rohitpandey13 creative-research-project-v1-1 anderalex803 advit200 wayan123 rmallof jose-turintech mustangs0786 zhangcun-yan zzzly-05 roesgaard peng-chong harel-coffee carbirbal forloop11 duy-ha-dev thierrymoudiki hrocha ratibhan renzhen95 jizongfox arifmudi dcsuka animesh svengiegerich enryh fazriislam ajaykumar1193 zggl erinmahoney

mrmr's Issues

Redundancy Matrix is asymmetric

I would naively assume that the redundancy matrix should be symmetric since pairwise mutual information (MI) is invariant against permutation, i.e. MI(A, B) = MI(B, A) if A, B are two independent features.
However, I observe that the matrix is not symmetric, not only due to a small numerical factor but quite significantly. Am i missing something?

I attached a minimal working example to reproduce the issue (if it is one). Each pixel is the pairwise redundancy between two features. Matrix is obviously asymmetric:

import matplotlib.pyplot as plt
import pandas as pd
from mrmr import mrmr_classif
from sklearn.datasets import make_classification


def generate_synthetic_data(n_samples=1_000, n_features=25, n_informative=10, n_redundant=5, seed=42):
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=n_informative,
        n_redundant=n_redundant,
        n_repeated=5,
        random_state=seed
    )
    return pd.DataFrame(X, columns=[f'feature_{i}' for i in range(n_features)]), pd.Series(y)


def check_symmetry(matrix, num_threshold=1e-3):
    # assess if redundancy matrix is symmetric
    symmetry_diff = matrix - matrix.T
    non_symmetric_indices = np.sum(np.abs(symmetry_diff) > num_threshold)
    print(f"non-symmetric indices: {100 * non_symmetric_indices / (matrix.shape[0] * matrix.shape[1]):.2f} %")


def main():
    X, y = generate_synthetic_data()
    _, _, redundancy = mrmr_classif(X=X, y=y, K=X.shape[1], return_scores=True)
    redundancy_matrix = np.array(redundancy)

    fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    plt.subplots_adjust(left=.1, right=.98, bottom=.1, top=.98)
    im = ax.imshow(redundancy_matrix)
    plt.colorbar(im, shrink=.8, ax=ax)
    plt.show()
    fig.savefig('test.pdf', bbox_inches='tight')

    check_symmetry(redundancy_matrix)

    # symmetrize
    check_symmetry((redundancy_matrix + redundancy_matrix.T) / 2)


if __name__ == '__main__':
    main()

Ability to Select Least Important Features

Thanks for making this great package! I have a small feature request which is to be able to select the least important features instead of the most important features using MRMR. I may be misunderstanding MRMR and it may not be able to be used for this. Would it be as simple as changing argmax to argmin on line 134 of main.py to get the least important features?

Extension to Regression?

Thank you for this package @smazzanti and the accompanying Medium piece. I'm glad to see a MRMR implementation here, as it seems like most of the existing Python implementations are somewhat stale (as opposed to the various Boruta packages, which seem to be in active development).

I was wondering if it would be possible to extend this package to regression problems? It seems like the technique should work generally, but I'm curious if there's some nuance I'm missing.

Will more extensions mentioned in uber paper be supported?

Will RDC for redundancy be supported in the near future?

Method

Which methods were used to assess relevance and redundancy, such as Random Forest, Mutual Information, or others?

Release new version on PyPi

The latest version currently available on PyPi (https://pypi.org/project/mrmr-selection/) has the issue mentioned in #23, which creates installation issues in specific cases.

mrmr_base function running forever due to pandas version issue

Hi @smazzanti,

I was running mrmr_classif on my local machine with a pandas DataFrame and a pandas DataSeries as *args, but the run never stopped and didn't show any error message.

While debugging mrmr_base function stepping into the code, I ended up with
RecursionError: maximum recursion depth exceeded while calling a Python object

Which is apparently a known issue in pandas.

Upgrading Pandas 1.1.5 to Pandas 1.5.2 fixed the issue (but I also needed to upgrade python because latest pandas support only latest python).

Illegal instruction (core dumped)

Parallel Issue

~\anaconda3\lib\site-packages\mrmr\main.py in mrmr_base(K, relevance_func, redundancy_func, relevance_args, redundancy_args, denominator_func, only_same_domain, return_scores, show_progress)
96 """
97
---> 98 relevance = relevance_func(**relevance_args)
99 features = relevance[relevance.fillna(0) > 0].index.to_list()
100 relevance = relevance.loc[features]

~\anaconda3\lib\site-packages\mrmr\pandas.py in f_classif(X, y, n_jobs)
43 def f_classif(X, y, n_jobs):
44 return parallel_df(_f_classif, X, y, n_jobs=n_jobs)
---> 45
46
47 def f_regression(X, y, n_jobs):

~\anaconda3\lib\site-packages\mrmr\pandas.py in parallel_df(func, df, series, n_jobs)
17 delayed(func)(df.iloc[:, col_chunk], series)
18 for col_chunk in col_chunks
---> 19 )
20 return pd.concat(lst)
21

~\anaconda3\lib\site-packages\joblib\parallel.py in call(self, iterable)
1096
1097 with self._backend.retrieval_context():
-> 1098 self.retrieve()
1099 # Make sure that we get a last message telling us we are done
1100 elapsed_time = time.time() - self._start_time

~\anaconda3\lib\site-packages\joblib\parallel.py in retrieve(self)
973 try:
974 if getattr(self._backend, 'supports_timeout', False):
--> 975 self._output.extend(job.get(timeout=self.timeout))
976 else:
977 self._output.extend(job.get())

~\anaconda3\lib\site-packages\joblib_parallel_backends.py in wrap_future_result(future, timeout)
565 AsyncResults.get from multiprocessing."""
566 try:
--> 567 return future.result(timeout=timeout)
568 except CfTimeoutError as e:
569 raise TimeoutError from e

~\anaconda3\lib\concurrent\futures_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()

~\anaconda3\lib\concurrent\futures_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

how to use it with a classifier

I used a pre-trained vgg16 model and extracted the features from some of the layers. Later i passed the features from GAP layer and concatenate it now i want to use mRMR module?

How do I get the scores for the selected feature?

Is there any way that I can get the scores for the selected k features?

gridsearch

Dear authors,
I want to know how can I find the optimum number of features in gridsearchCV ?

Does this method actually remove redundnacy?

First, great library and related blog posts. I was beginning to code this procedure and then stumbled upon your work. Here is my question / concern. I am using data likely akin to Uber for marketing purposes (mix of continuous and dummy coded, some highly predictive features, some irrelevant and correlation between designed features). If I look at the complete list of features and count the number of features with an (absolute value) correlation over 0.6 there are many. After the feature selection I see more relative correlation. This issue seems to be the F-stat can be very large for some correlated features and it cant be dampened enough by the denominator.

Here is an example from your quick starts (with a bit of change)

from mrmr import mrmr_classif
from sklearn.datasets import make_classification

# create some data
X, y = make_classification(n_samples = 1000, n_features = 100, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)


corr_X = X.corr().abs().clip(0.00001)

threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

# use mrmr classification
selected_features = mrmr_classif(X, y, K = 10)

threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.loc[selected_features,selected_features].apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

It seems to me we have far fewer features but the ones left show strong amount of correlation, in terms of proportion of the model candidate features that are correlated.....

Confusing licensing

https://github.com/smazzanti/mrmr/blob/main/LICENSE is the GPL-3.0 license, but https://github.com/smazzanti/mrmr/blob/main/setup.py identifies the license as MIT. Can you clarify what license actually applies to this package? Thanks in advance.

Cross-validation with MRMR

I'm thinking of using mrmr with a nested cross validation pipeline, do you have any suggestions on how to combine/aggregate the resulting selected features and ranking for each fold?

Minor update to documentation

Hello love this package. I've used it for work, and it's been very successful at addressing our use-case. I have a small request/recommendation. The doc string here indicates that redundancy should return a pd.Series object--we found that another requirement is that the object should have an index with the feature values. Without it, this chunk will update the redundancy dataframe to np.nan values.

Happy to make the adjustment and PR myself if you agree.

Spearman correlation for redundancy

How can i implement Spearman correlation coefficient to calculate redundancy? There's any way this could be done using spearmanr from scipy?

link bellow:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

Update PYPI scikit name reference in setup.py

Pip installing sklearn is being deprecated as noted here. The reference to scikit-learn in setup.py should be updated to use the name scikit-learn.

Happy to submit a PR, but it looks like repo permissions need to be updated to enable external PRs.

Usability within sklearn pipelines

@smazzanti, thank you for this package and the Medium article explaining MRMR in a very good and clear way. Hope this package can only increase and improve, and hopefully I will give some contribution soon.

I have start to try it out myself and I was wondering what is the intended way to use it from within a sklearn pipeline. For instance, in the F ANOVA example in sklearn's docs, here it how it looks like: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection_pipeline.html#sphx-glr-auto-examples-feature-selection-plot-feature-selection-pipeline-py

Would the idea be that we should wrap the mrmr_classif in a SelectKBest object as well to use it as a step in a sklearn Pipeline? If so, how should we control the K parameter inside the cross validation? I saw that SelectKBest has its own k parameter too: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

This is just a quick question to understand better the proposal, so forgive me in advance for any mistake or lack of clarity. Hopefully we can keep a healthy discussion.

Wish you all the best!

Return the Features Scores

I think it would be nice to return the features scores. Or have it as optional parameter.

Any requirement on joblib version ?

Hello,

I would like to use mrmr_regression to perform some feature selection but when I run the following line :

feature_select_MRMR = mrmr_regression(X=initial_data_set, y=target, K=10, n_jobs=1)

Where my data and target are in pd.DataFrame and pd.Series, I run into the following error :

AttributeError : Can't pickle local object 'correlation.._correlation'

For some reasons, I have to use an environment that contains old libraries among which joblib 0.11 . I see that there is no requirement on joblib's version for mrmr to run smoothly but I was still wondering if it was really the case (both numpy and panda satisfy version requirement).

If that is not a joblib version issue, What am I doing wrong in the way I use this function ? How can I get mrmr_regression to work ?

Thanks in advance for your answer,
Have a good day.

A task has failed to un-serialize.

Running the example code Im receiving the following error:

_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py", line 426, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/feature_creation_env/lib/python3.12/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/mrmr/init.py", line 1, in
from . import bigquery
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/mrmr/bigquery.py", line 3, in
from .main import mrmr_base, groupstats2fstat
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/mrmr/main.py", line 1, in
import pandas as pd
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/pandas/init.py", line 46, in
from pandas.core.api import (
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/pandas/core/api.py", line 47, in
from pandas.core.groupby import (
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/pandas/core/groupby/init.py", line 1, in
from pandas.core.groupby.generic import (
File "/opt/conda/envs/feature_creation_env/lib/python3.12/site-packages/pandas/core/groupby/generic.py", line 67, in
from pandas.core.frame import DataFrame
...
--> 754 raise self._result
755 return self._result
756 finally:

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Publish sdist to pypi for conda

I want to use this package over conda, and therefore try to add it to conda-forge over here. However, conda-forge prefers source distributions rather than wheels. Is it possible to publish those on PyPI as well?

Problem to install library

` pip install mrmr
Collecting mrmr
Using cached mrmr-0.9.2.tar.gz (18 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [10 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\Users\anton\AppData\Local\Temp\pip-install-zngtmzxk\mrmr_a64358380c864590abc49eb48cf68abc\setup.py", line 9, in
from mrmr import version as mrmr_version
File "C:\Users\anton\AppData\Local\Temp\pip-install-zngtmzxk\mrmr_a64358380c864590abc49eb48cf68abc\lib\mrmr_init.py", line 5, in
from ._discretemrmr import *
File "C:\Users\anton\AppData\Local\Temp\pip-install-zngtmzxk\mrmr_a64358380c864590abc49eb48cf68abc\lib\mrmr_discretemrmr.py", line 31, in
from fakemp import farmout, farmworker
ModuleNotFoundError: No module named 'fakemp'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.`

redundancy score is all nan

fucntions retruend all NaN score of redundancy even though relevance score is not NaN.

Support for M1/M2 Macs?

Import fail for silicon-based Mac OS -- help?

confuse mrmr_regression

hi @smazzanti ,
I just find out this package in your Medium article, thank you for develope the mrmr package for slove the regression problem and the aritcle in Medium is brilliant,
I have a dataset, after one-hot encoding, there are 800 dummy features and my taget variable is continuous variable.
but what i confused is,it's allowed if i want to use mrmr_regression function for select the dummy features ?
I've read the code for mrmr_regression function , if my feature are categorical variables(in my case, is binary 0/1), these categorical variables will do the Pearson correlation with continuous variable(target) for calculate redundancy.
I thought the Pearson correlation can only use between continuous - continuous.
Therefore, I'd like to check whether mrmr_regression can put the categorial variable for selection or not?
If can, do i need to modify the arguments for redundancy parameter?

Thank you for taking the time to read my issue, and hopefully you would give me some advise.

About regression problem

If I want to work on a regression problem, should I change the line 'from sklearn.datasets import make_classification' to 'from sklearn.datasets import make_regression'?

ModuleNotFoundError: No module named 'polars'

On attempting to import received the following error: ModuleNotFoundError: No module named 'polars'

I believe the polars module needs to be added to setup.py.

Order of selected features

Hi,
it seems that selected features are not ordered by relevance :

selected_features, relevance, redundacy = outputs_from_mrmr
print(relevance.values[selected_features[0:5]])
[77.04191317 7.65999306 2.1369512 71.32355801 44.90570244]

Is it a bug or the user is supposed to sort the values to get the decreasing curve of features relevance ?
Or maybe I missed something ?

Thanks a lot

mRMR Access - License

@smazzanti, thank you for your clearcut explanations in your Medium articles and in your documentation on Github. I'm excited to see this package grow even further in the coming years.

I was wondering if you could add a license to the Github README.md, just for some additional clarification on the permissions that are available with the mRMR library. I would like to use and potentially modify parts of this code.

Let me know as soon as possible. Thanks!

mrmr_classif doesn't return any results

Hi everyone,

I've been using the mrmr_classif method from the mRMR library for a while without any issues. However, recently it started returning no results and doesn't give any error messages. I've tried running it on Google Colab to see if the issue was specific to my local environment, but I encountered the same problem there.

Has anyone else experienced this or have any suggestions on how to resolve it?

Allow "fixed" features

Sometimes a user may wish to require certain features to be present in the feature space (for inference purposes/interpretability, etc.) and determine what other features minimize redundancy and maximize relevance with those features present.

Suggestion: revise mrmr_regression/classif function to take an optional list of k "fixed_features" such that the function returns a list of N features that always includes those k features and N-k other features that minimize redundancy/maximize relevance in the presence of those k fixed features.

AttributeError: module 'polars' has no attribute 'pearson_corr'

Looks like polars made a breaking change at some point and renamed it to .corr (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.corr.html#polars.corr)

Error when executing MRMR - IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)

Hello, I'm with this error below when executing mrmr, however, as far as I searched here, this shows that X and y used have different shapes, and mine doesn't, is there a way to solve this error?

My data x the mrmr example(It's a matrix of 0s and 1s):

from mrmr import mrmr_classif
selected_features = mrmr_classif(X=X_dtm, y=y, K=10)

The error when I execute with my data:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\externals\loky\process_executor.py", line 463, in _process_worker
    r = call_item()
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\externals\loky\process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py", line 589, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py", line 589, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\mrmr\pandas.py", line 31, in _f_classif
    return X.apply(lambda col: _f_classif_series(col, y)).fillna(0.0)
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 9423, in apply
    return op.apply().__finalize__(self, method="apply")
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\apply.py", line 678, in apply
    return self.apply_standard()
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\apply.py", line 798, in apply_standard
    results, res_index = self.apply_series_generator()
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\apply.py", line 814, in apply_series_generator
    results[i] = self.f(v)
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\mrmr\pandas.py", line 31, in <lambda>
    return X.apply(lambda col: _f_classif_series(col, y)).fillna(0.0)
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\mrmr\pandas.py", line 29, in _f_classif_series
    return sklearn_f_classif(x[x_not_na].to_frame(), y[x_not_na])[0][0]
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 1029, in __getitem__
    key = check_bool_indexer(self.index, key)
  File "C:\Users\vitor\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexing.py", line 2506, in check_bool_indexer
    raise IndexingError(
pandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
"""

The above exception was the direct cause of the following exception:

IndexingError                             Traceback (most recent call last)
Cell In [54], line 10
      3 # from sklearn.datasets import make_classification
      4 # X, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)
      5 # X = pd.DataFrame(X)
      6 # y = pd.Series(y)
      7 
      8 # select top 10 features using mRMR
      9 from mrmr import mrmr_classif
---> 10 selected_features = mrmr_classif(X=X_dtm, y=y, K=10)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\mrmr\pandas.py:171, in mrmr_classif(X, y, K, relevance, redundancy, denominator, cat_features, cat_encoding, only_same_domain, return_scores, n_jobs, show_progress)
    168 relevance_args = {'X': X, 'y': y}
    169 redundancy_args = {'X': X}
--> 171 return mrmr_base(K=K, relevance_func=relevance_func, redundancy_func=redundancy_func,
    172                  relevance_args=relevance_args, redundancy_args=redundancy_args,
    173                  denominator_func=denominator_func, only_same_domain=only_same_domain,
    174                  return_scores=return_scores, show_progress=show_progress)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\mrmr\main.py:98, in mrmr_base(K, relevance_func, redundancy_func, relevance_args, redundancy_args, denominator_func, only_same_domain, return_scores, show_progress)
     44 def mrmr_base(K, relevance_func, redundancy_func,
     45               relevance_args={}, redundancy_args={},
     46               denominator_func=np.mean, only_same_domain=False,
     47               return_scores=False, show_progress=True):
     48     """General function for mRMR algorithm.
     49 
     50     Parameters
   (...)
     95         List of selected features.
     96     """
---> 98     relevance = relevance_func(**relevance_args)
     99     features = relevance[relevance.fillna(0) > 0].index.to_list()
    100     relevance = relevance.loc[features]

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\mrmr\pandas.py:45, in f_classif(X, y, n_jobs)
     44 def f_classif(X, y, n_jobs):
---> 45     return parallel_df(_f_classif, X, y, n_jobs=n_jobs)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\mrmr\pandas.py:17, in parallel_df(func, df, series, n_jobs)
     15 n_jobs = min(cpu_count(), len(df.columns)) if n_jobs == -1 else min(cpu_count(), n_jobs)
     16 col_chunks = np.array_split(range(len(df.columns)), n_jobs)
---> 17 lst = Parallel(n_jobs=n_jobs)(
     18     delayed(func)(df.iloc[:, col_chunk], series)
     19     for col_chunk in col_chunks
     20 )
     21 return pd.concat(lst)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py:1952, in Parallel.__call__(self, iterable)
   1946 # The first item from the output is blank, but it makes the interpreter
   1947 # progress until it enters the Try/Except block of the generator and
   1948 # reach the first `yield` statement. This starts the aynchronous
   1949 # dispatch of the tasks to the workers.
   1950 next(output)
-> 1952 return output if self.return_generator else list(output)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py:1595, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1592     yield
   1594     with self._backend.retrieval_context():
-> 1595         yield from self._retrieve()
   1597 except GeneratorExit:
   1598     # The generator has been garbage collected before being fully
   1599     # consumed. This aborts the remaining tasks if possible and warn
   1600     # the user if necessary.
   1601     self._exception = True

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py:1699, in Parallel._retrieve(self)
   1692 while self._wait_retrieval():
   1693 
   1694     # If the callback thread of a worker has signaled that its task
   1695     # triggered an exception, or if the retrieval loop has raised an
   1696     # exception (e.g. `GeneratorExit`), exit the loop and surface the
   1697     # worker traceback.
   1698     if self._aborting:
-> 1699         self._raise_error_fast()
   1700         break
   1702     # If the next job is not ready for retrieval yet, we just wait for
   1703     # async callbacks to progress.

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py:1734, in Parallel._raise_error_fast(self)
   1730 # If this error job exists, immediatly raise the error by
   1731 # calling get_result. This job might not exists if abort has been
   1732 # called directly or if the generator is gc'ed.
   1733 if error_job is not None:
-> 1734     error_job.get_result(self.timeout)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py:736, in BatchCompletionCallBack.get_result(self, timeout)
    730 backend = self.parallel._backend
    732 if backend.supports_retrieve_callback:
    733     # We assume that the result has already been retrieved by the
    734     # callback thread, and is stored internally. It's just waiting to
    735     # be returned.
--> 736     return self._return_or_raise()
    738 # For other backends, the main thread needs to run the retrieval step.
    739 try:

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\joblib\parallel.py:754, in BatchCompletionCallBack._return_or_raise(self)
    752 try:
    753     if self.status == TASK_ERROR:
--> 754         raise self._result
    755     return self._result
    756 finally:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

multiclass Y

Thank you for this package @smazzanti .I would like to ask about multi target how it works, same as a single label? I noticed this in the introduction, but it doesn't seem to be in the code.Thanks.

Final Score

Is it possible to additionally return the final score used for sorting? I would like to make the number of features actually used dependent on a threshold on this score. Or is there a simple formula so that I can calculate this using redundancy and relevance?

Possible error in docstring mrmr_classif (pandas)

Hello,

I think there is an imprecision in the docstring of the function "mrmr_classif" present in the file pandas.py.
The docstring says:

cat_features: list (optional, default=None)
    List of categorical features. If None, all string columns will be encoded.

but from the code I see that the function that performs the encoding (i.e. encode_df) is called only if cat_features is provided. I do not see a point in the code where strings columns are automatically encoded (maybe I am missing something).

All the best,

FunctionTransformer with mrmr_regression & gridsearchCV Issue

Hello,

I attempted to use mrmr_regression in a pipeline with gridsearchCV to optimize the argument K as a hyperparameter and ran into the following issue:

I made a function that would return just dataframe with a sparse feature set.
Then, I used FunctionTransformer to convert this into a transformer to be used in a pipeline.
After adding some arbitrary sklearn model (sklearn.kernel_ridge.KernelRidge), to the pipeline and trying to use gridsearchCV, it returned the following error: 'numpy.ndarray' object has no attribute 'columns'. The same error came up when trying to call "pipe.fit()" without gridsearchCV.

I think this is referring to the fact that your function 'parallel_df' called in 'f_regression' uses df.columns, and gridsearch might be trying to feed it a 2d array. Do you have any suggestions on how to get around this issue?

Thank you very much

Pypi Release

Great package! Any plans on releasing this on pypi for easier installation?

n_jobs argument with random forest

Version 0.2.4 introduces this bug:
TypeError: random_forest_classif() got an unexpected keyword argument 'n_jobs'

Could you please have a look?

smazzanti / mrmr Goto Github PK

mrmr's Introduction

What is mRMR

Why is it unique

When to use mRMR

How to install this package

How to use this package

1. Pandas example

2. Polars example

3. Spark example

4. Google BigQuery example

Reference

mrmr's People

Contributors

Stargazers

Watchers

Forkers

mrmr's Issues

Recommend Projects

Recommend Topics

Recommend Org