Coder Social home page Coder Social logo

scikit-learn / enhancement_proposals Goto Github PK

View Code? Open in Web Editor NEW
48.0 48.0 34.0 203 KB

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications

Home Page: https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest

License: BSD 3-Clause "New" or "Revised" License

Makefile 2.08% Python 95.16% Batchfile 2.76%

enhancement_proposals's Introduction

CirrusCI Codecov CircleCI Nightly wheels Black PythonVersion PyPi DOI Benchmark

image

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies

scikit-learn requires:

  • Python (>= 3.9)
  • NumPy (>= 1.19.5)
  • SciPy (>= 1.6.0)
  • joblib (>= 1.2.0)
  • threadpoolctl (>= 2.0.0)

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 1.0 and later require Python 3.7 or newer. scikit-learn 1.1 and later require Python 3.8 or newer.

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with Display) require Matplotlib (>= 3.3.4). For running the examples Matplotlib >= 3.3.4 is required. A few examples require scikit-image >= 0.17.2, a few examples require pandas >= 1.1.5, some examples require seaborn >= 0.9.0 and plotly >= 5.14.0.

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

pip install -U scikit-learn

or conda:

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions.

Changelog

See the changelog for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Source code

You can check the latest sources with the command:

git clone https://github.com/scikit-learn/scikit-learn.git

Contributing

To learn more about making a contribution to scikit-learn, please see our Contributing guide.

Testing

After installation, you can launch the test suite from outside the source directory (you will need to have pytest >= 7.1.2 installed):

pytest sklearn

See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage for more information.

Random number generation can be controlled during testing by setting the SKLEARN_SEED environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation

Communication

Citation

If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

enhancement_proposals's People

Contributors

adrinjalali avatar agramfort avatar amueller avatar gaelvaroquaux avatar glemaitre avatar jjerphan avatar jnothman avatar lorentzenchr avatar nicolashug avatar tguillemot avatar thomasjpfan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

enhancement_proposals's Issues

SLEP006 (sample props) should handle when metaestimator consumes the same key as its descendant

From scikit-learn/scikit-learn#21284 (comment):

The case of a metaestimator adding support for a prop that is requested by its child is indeed a tricky one. I can't yet see a way to make this generally backwards compatible within the SLEP006 proposal. This makes me sad.

Indeed, generally a metaestimator supporting the same prop name as one of its children is tricky. I.e. if the metaestimator supports metadata x and its child requests metadata x, the metaestimator should only work where either:

  • the child's request aliases x to another name without such a clash;
  • the child's request and the metaestimator's request for x implies being passed the same keyword argument.

In other cases, this must raise an error. This is something, I'm pretty sure, we've not yet covered in SLEP006 (and it's a pretty messy and intricate consequence of having the caller responsible for delivering metadata in accordance with the request).

Deprecation would be pretty tricky as far as I can tell.

Scorers might need to know about training and testing data

This is not a PR because I didn't write this yet. It's more a very loose RFC.

I think scorers might need to be able to distinguish between training and test data.
I think there were more cases but there are two obvious ones:
the R^2 is currently computed using the test-set mean. That seems really odd, and breaks for LOO.
When doing cross-validation, the classes that are present can change, which can impact things like macro-f1 in weird ways, and can also lead to errors in LOO (scikit-learn/scikit-learn#4546)

I'm not sure if this is a good enough case yet, but I wanted somewhere to take a note ;)

SLEP007: Meta-estimators section expansion

@amueller has the following concern on the meta-estimators section of the SLEP:

Shouldn't we list all meta-estimators that are transformers? What about FeatureUnion and RFECV? I guess maybe we're talking about meta-estimators that are not feature selectors, because those are easy.

I'm personally not sure. I kinda prefer to have a guideline which can be used to the meta estimators, rather than listing them in the SLEP.

SLEP 007 feature-name generation: adding a constructor argument for verbosity

Basically right now SLEP 007 suggests to add a constructor parameter to all transformers that are not feature selectors, right?

I see the motivation and I think it's actually hard to come up with a better solution (I don't have one right now), but I'm also concerned with adding feature name specific things to the constructor. It seems orthogonal to the working of the estimator.

I think the alternative that @adrinjalali suggested was having callbacks for formatting feature names (I don't remember the details tbh), but that was pretty complex.

Maybe we could have a method that people could overwrite or something like that? I'm not sure. To me this is the one contentious part of the SLEP.

SLEPs in draft missing from index.rst

Several SLEPs have been merged but are not available from the table of contents on Read The Docs. They should either be added, or we should include a complete TOC of all SLEPs.

Augment template

I think we should augment the template to start with a motivation and an example of what the solution will look like to users.

Questions and comments about SLEP006 -- Sample props

we can consider the use of keys that are not limited to strings or valid identifiers (and hence are not limited to using _ as a delimiter).

I don't understand how we are currently limited to using _ as a delimiter. Should this be __? But even then I still don't follow.

TODO: proceed from here. Note that this change implies the need to add
a parameter to unwrap_X, since we will now append an additional column to X.

I don't understand why we don't need to add an additional column to X in this case.

while a GridSearchCV wrapping a Pipeline currently takes parameters with keys like {step_name}{prop_name}, this explicit routing, and conflict with GridSearchCV routing destinations, implies keys like estimator{step_name}__{prop_name}.

I'm not sure I completely understand this. A small illustration might help? Also, does "GridSearchCV routing destinations" refers to param_grid instead?

All consumers would be required to check that

This sentence seems unfinished?

Here the meta-estimator provides only what its each of its children requests. The meta-estimator would also need to request, on behalf of its children, any prop that descendant consumers require.

I could be wrong, but this doesn't seem to define the "essence" of solution 4. It seems to me that in solution 3, as well, the meta estimators only provides what the childrens need/request?

  • There are still a bunch of TODOs: we should remove them before calling for a vote
  • Related: "(likley out of scope) passing sample properties..." The scope of the SLEP should be non-ambiguous so maybe we want to remove this bullet point.
  • Not important but the use # %% doesn't render in the html docs.
  • It would help to have a brief description of the proposed solution at the top, or an "abstract" section, as per our SLEP template.

Questions and comments about SLEP006 -- Sample props

I've made a pass over the SLEP. It is overall very clear and the different cases really help. Thanks @jnothman and @adrinjalali for your efforts.

Here's a list of questions and comments that I have after reading it.

we can consider the use of keys that are not limited to strings or valid identifiers (and hence are not limited to using _ as a delimiter).

I don't understand how we are currently limited to using _ as a delimiter. Should this be __? But even then I still don't follow.

TODO: proceed from here. Note that this change implies the need to add
a parameter to unwrap_X, since we will now append an additional column to X.

I don't understand why we don't need to add an additional column to X in this case.

while a GridSearchCV wrapping a Pipeline currently takes parameters with keys like {step_name}{prop_name}, this explicit routing, and conflict with GridSearchCV routing destinations, implies keys like estimator{step_name}__{prop_name}.

I'm not sure I completely understand this. A small illustration might help? Also, does "GridSearchCV routing destinations" refers to param_grid instead?

All consumers would be required to check that

This sentence seems unfinished?

About solution 4:

Here the meta-estimator provides only what its each of its children requests. The meta-estimator would also need to request, on behalf of its children, any prop that descendant consumers require.

I could be wrong, but this doesn't seem to define the "essence" of solution 4, especially comparatively to solution 3. It seems to me that in solution 3, as well, the meta estimators only provides what the childrens need/request:

Each meta-estimator is given a routing specification which it must follow in passing only the required parameters to each of its children

For estimators to be cloned, this request information needs to be cloned with it

I'm not sure I understand this sentence. Seems like the grammar may be wrong?

get_props_request will return a dict

It is said prior that get_props_request would return a list (and possibly a dict)

One thing that isn't clear to me is the behaviour of solution 4., outside of the use of meta-estimators.

For example, what does this do:

lr = LogisticRegression().set_props_request(['sample_weights'])
lr.fit(X, y, sample_weights=None)  # ?

lr = LogisticRegression().set_props_request([])
lr.fit(X, y, sample_weights=sample_weights)  # ?
  1. All examples seem to illustrate how cross_validate is used, but none of them illustrate what a call to fit() may look like. I think this may be useful as well (and might resolve .8 for me)

**kw syntax will be used to pass props by key.

In the examples of solution 4., **kw are never used. Should they? Or does this refer to a call to fit?

  1. Other stuff
  • There are still a bunch of TODOs: we should remove them before calling for a vote
  • Related: "(likley out of scope) passing sample properties..." The scope of the SLEP should be non-ambiguous so maybe we want to remove this bullet point.
  • Not important but the use # %% doesn't render in the html docs.
  • It would help to have a brief description of the proposed solution at the top, or an "abstract" section, as per our SLEP template.

SLEP 001: why do we need trans_modify?

cc @GaelVaroquaux

Coming back to SLEP 1 I don't see / remember the need for trans_modify.
I'm now not sure why we need this.
The motivation the SLEP gives is

Creating y in a pipeline makes error measurement harder For some usecases, test time needs to modify the number of samples (for instance data loading from a file)

I think that makes it much harder and I don't think it's as necessary as the training-time version.

Similarly I'm not sure I understand the motivation for partial_fit_modify.

My main motivation in this would be to distinguish training time and test time, and that only requires a new method that basically replaces fit_transform within a pipeline or other meta-estimator.

Not sure I like fit_modify for that. My thoughts right now would be forward or maybe fit_forward (though that sounds too much like feed-forward - how about feed lol). modify sounds like an in-place operation to me. D3M uses produce which is quite generic but might work (probably fit_produce, produce is their version of both predict and transform)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.