Coder Social home page Coder Social logo

dunnkers / fseval Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 6.0 24.13 MB

Benchmarking framework for Feature Selection and Feature Ranking algorithms ๐Ÿš€

Home Page: https://dunnkers.com/fseval

License: MIT License

Python 91.04% JavaScript 7.03% CSS 1.04% TypeScript 0.60% Dockerfile 0.19% Shell 0.09%
benchmarks feature-rankers feature-selection wandb hydra machine-learning python feature-ranking benchmarking benchmarking-framework scikit-learn automl

fseval's Introduction

fseval

build status pypi badge Black Downloads PyPI - Python Version codecov Language grade: Python PyPI - License DOI Open in Remote - Containers DOI

Benchmarking framework for Feature Selection and Feature Ranking algorithms ๐Ÿš€

Demo

Open In Colab

Install

  1. Installation through PyPi โญ๏ธ RECOMMENDED OPTION

    pip install fseval
  2. Installation from source

    git clone https://github.com/dunnkers/fseval.git
    cd fseval
    pip install -r requirements.txt
    pip install .

You can now import fseval import fseval in your Python code, or use the fseval command in your terminal. For an example, run fseval --help. For more information, see the documentation link below โŒ„.

Documentation

docs preview

See the documentation.

About

Built at the University of Groningen and published in The Journal of Open Source Software (JOSS):

Project has some early roots in another project, which is a feature selection algorithm called FeatBoost (see full citation below).

A. Alsahaf, N. Petkov, V. Shenoy, G. Azzopardi, "A framework for feature selection through boosting", Expert Systems with Applications, Volume 187, 2022, 115895, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2021.115895.

The open source Python code of FeatBoost is available in https://github.com/amjams/FeatBoost.


2023 โ€” Jeroen Overschie

fseval's People

Contributors

amjams avatar dependabot[bot] avatar dunnkers avatar geazzo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fseval's Issues

Pipeline: compatibility with sklearn

Can we make the pipeline compatible still with sklearn?

  • The Pipeline class now takes Dataset, CrossValidator and friends.
  • -> can we pass these to cfg instead?

Multiprocessing support - caching

To support caching:

  • Make wandb.run.id and everything we need to save pickled files to the filesystem already available in the _fit_estimator context.

Allow decoupling metric calculation

Such, users can, by themselves, re-run runs and compute new metrics.

  • โš  score["importance/r2_score"].astype(float) was removed: we need to make sure the DataFrame types are set in any way - even if they are null.

Moving metric calculation to new API:

  • importance/r2_score
  • importance/log_loss
  • support/accuracy
  • ranking/r2_score
  • Upload validation tables
  • Upload ranking tables
  • Upload charts

Allow custom metrics:

  • Subset validator

Incorrect feature importance ground-truths

Bootstrap sampling also shuffles the dataset dimensions, e.g. see:
Screen Shot 2021-06-16 at 23 27 51

.. where, presumably, the relevant dimensions should be the first 4. This might not be the case in practice.

-> does Resample bootstrap reshuffle the dataset dimensions? i.e., make different dimensions relevant than were defined in the ground-truth?

def _score_with_feature_importances(self, score, X_importances):
"""Scores this feature ranker with the available dataset ground-truth relevant
features, which are to be known apriori. Supports three types of feature rankings:
- a real-valued feature importance vector
- a boolean-valued feature support vector
- an integer-valued feature ranking vector."""
### Feature importances
if self.ranker.estimates_feature_importances:
# predicted feature importances, normalized.
y_pred = np.asarray(self.ranker.feature_importances_)
y_pred = y_pred / sum(y_pred)
# r2 score
y_true = X_importances
score["importance.r2_score"] = r2_score(y_true, y_pred)
# log loss
y_true = X_importances > 0
score["importance.log_loss"] = log_loss(y_true, y_pred, labels=[0, 1])
### Feature support
if self.ranker.estimates_feature_support:
# predicted feature support
y_pred = np.asarray(self.ranker.feature_support_, dtype=bool)
# accuracy
y_true = X_importances > 0
score["support.accuracy"] = accuracy_score(y_true, y_pred)
### Feature ranking
# grab ranking through either (1) `ranking_` or (2) `feature_importances_`
ranking = None
if self.ranker.estimates_feature_ranking:
ranking = self.ranker.feature_ranking_
elif self.ranker.estimates_feature_importances:
ranking = self.ranker.feature_importances_
# compute ranking r2 score
if ranking is not None:
# predicted feature ranking, re-ordered and normalized.
y_pred = self._scores_to_ranking(ranking)
y_pred = y_pred / sum(y_pred)
# convert ground-truth to a ranking as well.
y_true = self._scores_to_ranking(X_importances)
y_true = y_true / sum(y_true)
# in r2 score, only consider **relevant** features, not irrelevant ones. in
# this way, when `X_importances = [0, 2, 4, 0, 0]` we do not get misleadingly
# high scores because the ranking also
sample_weight = np.ones_like(X_importances)
sample_weight[X_importances == 0] = 0.0
# r2 score
score["ranking.r2_score"] = r2_score(
y_true, y_pred, sample_weight=sample_weight
)

Add diagram in README

In README, add a diagram explaining the pipeline. Where CV is applied, bootstraps, etc.

By default, return results as DataFrames

Return:

  1. The experiment config as DataFrame (this is not necessary, the user has access to the config at this time)
  2. on_table columns - each as one DataFrame

e.g.

@hydra.main(config_path="conf", config_name="my_config")
def main(cfg: PipelineConfig) -> None:
    results: dict = run_pipeline(cfg)

    # append to all these results
    results["feature_importance"].to_csv("my_results.csv")


if __name__ == "__main__":
    main()

tabnet + xor error

[2021-05-08 18:34:31,725][fseval.experiment][INFO] - TabNet feature ranking: [0.25795756, 0.31239626, 0.00081796, 0.077292, 0.05039946, 0.13356205, 0.00071954, 0.00776358, 0.00080948, 0.15828211]
wandb: WARNING feature_importances_ or coef_ attribute not in classifier. Cannot plot feature importances.
Error executing job with overrides: ['ranker=tabnet', 'dataset=xor']
Traceback (most recent call last):
  File "/Users/dunnkers/git/fseval_2.0/fseval/main.py", line 10, in main
    experiment.run()
  File "/Users/dunnkers/git/fseval_2.0/fseval/experiment.py", line 71, in run
    ranker_log["ranker_log_loss"] = log_loss(
  File "/Users/dunnkers/.pyenv/versions/3.9.2/envs/fseval/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/Users/dunnkers/.pyenv/versions/3.9.2/envs/fseval/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 2249, in log_loss
    transformed_labels = lb.transform(y_true)
  File "/Users/dunnkers/.pyenv/versions/3.9.2/envs/fseval/lib/python3.9/site-packages/sklearn/preprocessing/_label.py", line 350, in transform
    return label_binarize(y, classes=self.classes_,
  File "/Users/dunnkers/.pyenv/versions/3.9.2/envs/fseval/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/Users/dunnkers/.pyenv/versions/3.9.2/envs/fseval/lib/python3.9/site-packages/sklearn/preprocessing/_label.py", line 543, in label_binarize
    raise ValueError("%s target data is not supported with label "
ValueError: continuous target data is not supported with label binarization
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Revamp

Config

  • Put all config that should be actually built-in inside config.py. e.g. using cs.store. Remove all yaml files. This is a library.
  • Expose all .py modules in 1 file. e.g. StorageConfig, EstimatorConfig. This way, it can be much more easily configured.
  • Move all .yaml config that should not be built-in to the testing environment.
  • Put everything from my_config.yaml into config.py.

Documentation

  • Main example should be 3 steps: (1) code file, (2) executing a multirun using the commandline and (3) showing a basic comparison plot.
  • Show a pipeline diagram
  • [ ]

Incompatibility between local- and wandb storage providers

When local, files are stored in:

multirun/2021-05-31/09-27-35/0

When wandb, files are stored in the same directory, but wandb will not pick them up, presumably. Should test. This is the directory they then should be in:

multirun/2021-05-31/09-27-35/0/wandb/run-20210621_022823-2xb0n6rz/files

..OR, does it all work, and does only the script in msc-thesis not pick up the 'new' directories?

Metrics run

  1. Re-run k-NN cohort
  2. Re-run cohort-1
  • normalize feature importances
  • validate feature subset
  • compute feature subset stability

Make OpenML columns configurable

-> Currently, the adapter assumes that only the quantitative columns are relevant. But we might want to apply one-hot-encoding to transform discrete columns into quantitative ones.

OpenML / wandb always need to be both installed

(fseval-readme-example) โžœ  fseval-readme-example python somebenchmark.py --help
Traceback (most recent call last):
  File "/Users/dunnkers/git/fseval-readme-example/somebenchmark.py", line 4, in <module>
    from fseval.adapters import OpenML
  File "/Users/dunnkers/git/fseval/fseval/adapters/__init__.py", line 2, in <module>
    from .wandb import Wandb
  File "/Users/dunnkers/git/fseval/fseval/adapters/wandb.py", line 4, in <module>
    import wandb
ModuleNotFoundError: No module named 'wandb'

Because we are always importing both modules:

from .openml import OpenML
from .wandb import Wandb
__all__ = ["OpenML", "Wandb"]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.