scikit-learn / enhancement_proposals Goto Github PK

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications

Home Page: https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest

License: BSD 3-Clause "New" or "Revised" License

Makefile 2.08% Python 95.16% Batchfile 2.76%

enhancement_proposals's Introduction

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies

scikit-learn requires:

Python (>= 3.9)
NumPy (>= 1.19.5)
SciPy (>= 1.6.0)
joblib (>= 1.2.0)
threadpoolctl (>= 2.0.0)

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 1.0 and later require Python 3.7 or newer. scikit-learn 1.1 and later require Python 3.8 or newer.

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with Display) require Matplotlib (>= 3.3.4). For running the examples Matplotlib >= 3.3.4 is required. A few examples require scikit-image >= 0.17.2, a few examples require pandas >= 1.1.5, some examples require seaborn >= 0.9.0 and plotly >= 5.14.0.

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

pip install -U scikit-learn

or conda:

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions.

Changelog

See the changelog for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Important links

Official source code repo: https://github.com/scikit-learn/scikit-learn
Download releases: https://pypi.org/project/scikit-learn/
Issue tracker: https://github.com/scikit-learn/scikit-learn/issues

Source code

You can check the latest sources with the command:

git clone https://github.com/scikit-learn/scikit-learn.git

Contributing

To learn more about making a contribution to scikit-learn, please see our Contributing guide.

Testing

After installation, you can launch the test suite from outside the source directory (you will need to have pytest >= 7.1.2 installed):

pytest sklearn

See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage for more information.

Random number generation can be controlled during testing by setting the SKLEARN_SEED environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation

HTML documentation (stable release): https://scikit-learn.org
HTML documentation (development version): https://scikit-learn.org/dev/
FAQ: https://scikit-learn.org/stable/faq.html

Communication

Mailing list: https://mail.python.org/mailman/listinfo/scikit-learn
Logos & Branding: https://github.com/scikit-learn/scikit-learn/tree/main/doc/logos
Blog: https://blog.scikit-learn.org
Calendar: https://blog.scikit-learn.org/calendar/
Twitter: https://twitter.com/scikit_learn
Stack Overflow: https://stackoverflow.com/questions/tagged/scikit-learn
GitHub Discussions: https://github.com/scikit-learn/scikit-learn/discussions
Website: https://scikit-learn.org
LinkedIn: https://www.linkedin.com/company/scikit-learn
YouTube: https://www.youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw/playlists
Facebook: https://www.facebook.com/scikitlearnofficial/
Instagram: https://www.instagram.com/scikitlearnofficial/
TikTok: https://www.tiktok.com/@scikit.learn
Mastodon: https://mastodon.social/@[email protected]
Discord: https://discord.gg/h9qyrK8Jc8

Citation

If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

enhancement_proposals's People

Contributors

Stargazers

Watchers

enhancement_proposals's Issues

SLEP006 (sample props) should handle when metaestimator consumes the same key as its descendant

From scikit-learn/scikit-learn#21284 (comment):

The case of a metaestimator adding support for a prop that is requested by its child is indeed a tricky one. I can't yet see a way to make this generally backwards compatible within the SLEP006 proposal. This makes me sad.

Indeed, generally a metaestimator supporting the same prop name as one of its children is tricky. I.e. if the metaestimator supports metadata x and its child requests metadata x, the metaestimator should only work where either:

the child's request aliases x to another name without such a clash;

the child's request and the metaestimator's request for x implies being passed the same keyword argument.

In other cases, this must raise an error. This is something, I'm pretty sure, we've not yet covered in SLEP006 (and it's a pretty messy and intricate consequence of having the caller responsible for delivering metadata in accordance with the request).

Deprecation would be pretty tricky as far as I can tell.

SLEP needed: fit_transform does something other than fit(X).transform(X)

This is required for stacking and leave-one-out target encoders if we want a nice design. We already kind of do this in some places but don't have a coherent contract. I think a slep would be nice.

I still prefer advancement proposals ;)

Rename repository to slep for nicer url

I would suggest renaming this repo to slep so we can have scikit-learn.github.io/slep instead of scikit-learn.github.io/enhancement_proposal.

Scorers might need to know about training and testing data

This is not a PR because I didn't write this yet. It's more a very loose RFC.

I think scorers might need to be able to distinguish between training and test data.
I think there were more cases but there are two obvious ones:
the R^2 is currently computed using the test-set mean. That seems really odd, and breaks for LOO.
When doing cross-validation, the classes that are present can change, which can impact things like macro-f1 in weird ways, and can also lead to errors in LOO (scikit-learn/scikit-learn#4546)

I'm not sure if this is a good enough case yet, but I wanted somewhere to take a note ;)

SLEP needed: freezing estimators

There's lots of implementations and issues on this, but we couldn't get consensus. We need a SLEP.

SLEP007: Meta-estimators section expansion

@amueller has the following concern on the meta-estimators section of the SLEP:

Shouldn't we list all meta-estimators that are transformers? What about FeatureUnion and RFECV? I guess maybe we're talking about meta-estimators that are not feature selectors, because those are easy.

I'm personally not sure. I kinda prefer to have a guideline which can be used to the meta estimators, rather than listing them in the SLEP.

Shall we rename the default branch here?

We've moved many repos to main, but not this one, shall we do it?

SLEP 007 feature-name generation: adding a constructor argument for verbosity

Basically right now SLEP 007 suggests to add a constructor parameter to all transformers that are not feature selectors, right?

I see the motivation and I think it's actually hard to come up with a better solution (I don't have one right now), but I'm also concerned with adding feature name specific things to the constructor. It seems orthogonal to the working of the estimator.

I think the alternative that @adrinjalali suggested was having callbacks for formatting feature names (I don't remember the details tbh), but that was pretty complex.

Maybe we could have a method that people could overwrite or something like that? I'm not sure. To me this is the one contentious part of the SLEP.

SLEPs in draft missing from index.rst

Several SLEPs have been merged but are not available from the table of contents on Read The Docs. They should either be added, or we should include a complete TOC of all SLEPs.

SLEP012 input_feature_names_

In SLEP012, it suggests a new attribute, input_feature_names_, onto the estimators:

https://github.com/scikit-learn/enhancement_proposals/blob/master/slep012/proposal.rst#L25

Is this in scope of the SLEP?

Mark implemented SLEPS

As of today, https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/index.html lists two SLEPs as accepted:

SLEP009
SLEP010

It is, however, not clear from the webpage if those SLEPs are implemented. I propose to link at least the scikit-learn version, maybe also relevant PRs, that implemented them.

Augment template

I think we should augment the template to start with a motivation and an example of what the solution will look like to users.

SLEP needed: warm starting with Pipelines, and safer warm starting

It should be possible for an estimator to check or propose that a particular parameter change would be appropriate to do with warm start. It should also be possible to for the estimator's fit to be aware of what parameters have changed for its warm start.

See scikit-learn/scikit-learn#8230 (comment)

SLEP needed: slicling pipelines

There's some implementations on slicing and subsetting pipelines but there's no agreement. We need a slep to discuss. scikit-learn/scikit-learn#8431 scikit-learn/scikit-learn#8448 scikit-learn/scikit-learn#8414

Questions and comments about SLEP006 -- Sample props

we can consider the use of keys that are not limited to strings or valid identifiers (and hence are not limited to using _ as a delimiter).

I don't understand how we are currently limited to using _ as a delimiter. Should this be __? But even then I still don't follow.

TODO: proceed from here. Note that this change implies the need to add
a parameter to unwrap_X, since we will now append an additional column to X.

I don't understand why we don't need to add an additional column to X in this case.

while a GridSearchCV wrapping a Pipeline currently takes parameters with keys like {step_name}{prop_name}, this explicit routing, and conflict with GridSearchCV routing destinations, implies keys like estimator{step_name}__{prop_name}.

I'm not sure I completely understand this. A small illustration might help? Also, does "GridSearchCV routing destinations" refers to param_grid instead?

All consumers would be required to check that

This sentence seems unfinished?

Here the meta-estimator provides only what its each of its children requests. The meta-estimator would also need to request, on behalf of its children, any prop that descendant consumers require.

I could be wrong, but this doesn't seem to define the "essence" of solution 4. It seems to me that in solution 3, as well, the meta estimators only provides what the childrens need/request?

There are still a bunch of TODOs: we should remove them before calling for a vote
Related: "(likley out of scope) passing sample properties..." The scope of the SLEP should be non-ambiguous so maybe we want to remove this bullet point.
Not important but the use # %% doesn't render in the html docs.
It would help to have a brief description of the proposed solution at the top, or an "abstract" section, as per our SLEP template.

CI: check the docs build without warning

Questions and comments about SLEP006 -- Sample props

I've made a pass over the SLEP. It is overall very clear and the different cases really help. Thanks @jnothman and @adrinjalali for your efforts.

Here's a list of questions and comments that I have after reading it.

we can consider the use of keys that are not limited to strings or valid identifiers (and hence are not limited to using _ as a delimiter).

I don't understand how we are currently limited to using _ as a delimiter. Should this be __? But even then I still don't follow.

TODO: proceed from here. Note that this change implies the need to add
a parameter to unwrap_X, since we will now append an additional column to X.

I don't understand why we don't need to add an additional column to X in this case.

while a GridSearchCV wrapping a Pipeline currently takes parameters with keys like {step_name}{prop_name}, this explicit routing, and conflict with GridSearchCV routing destinations, implies keys like estimator{step_name}__{prop_name}.

I'm not sure I completely understand this. A small illustration might help? Also, does "GridSearchCV routing destinations" refers to param_grid instead?

All consumers would be required to check that

This sentence seems unfinished?

About solution 4:

Here the meta-estimator provides only what its each of its children requests. The meta-estimator would also need to request, on behalf of its children, any prop that descendant consumers require.

I could be wrong, but this doesn't seem to define the "essence" of solution 4, especially comparatively to solution 3. It seems to me that in solution 3, as well, the meta estimators only provides what the childrens need/request:

Each meta-estimator is given a routing specification which it must follow in passing only the required parameters to each of its children

For estimators to be cloned, this request information needs to be cloned with it

I'm not sure I understand this sentence. Seems like the grammar may be wrong?

get_props_request will return a dict

It is said prior that get_props_request would return a list (and possibly a dict)

One thing that isn't clear to me is the behaviour of solution 4., outside of the use of meta-estimators.

For example, what does this do:

lr = LogisticRegression().set_props_request(['sample_weights'])
lr.fit(X, y, sample_weights=None)  # ?

lr = LogisticRegression().set_props_request([])
lr.fit(X, y, sample_weights=sample_weights)  # ?

All examples seem to illustrate how cross_validate is used, but none of them illustrate what a call to fit() may look like. I think this may be useful as well (and might resolve .8 for me)

**kw syntax will be used to pass props by key.

In the examples of solution 4., **kw are never used. Should they? Or does this refer to a call to fit?

Other stuff

There are still a bunch of TODOs: we should remove them before calling for a vote
Related: "(likley out of scope) passing sample properties..." The scope of the SLEP should be non-ambiguous so maybe we want to remove this bullet point.
Not important but the use # %% doesn't render in the html docs.
It would help to have a brief description of the proposed solution at the top, or an "abstract" section, as per our SLEP template.

SLEP 001: why do we need trans_modify?

cc @GaelVaroquaux

Coming back to SLEP 1 I don't see / remember the need for trans_modify.
I'm now not sure why we need this.
The motivation the SLEP gives is

Creating y in a pipeline makes error measurement harder For some usecases, test time needs to modify the number of samples (for instance data loading from a file)

I think that makes it much harder and I don't think it's as necessary as the training-time version.

Similarly I'm not sure I understand the motivation for partial_fit_modify.

My main motivation in this would be to distinguish training time and test time, and that only requires a new method that basically replaces fit_transform within a pipeline or other meta-estimator.

Not sure I like fit_modify for that. My thoughts right now would be forward or maybe fit_forward (though that sounds too much like feed-forward - how about feed lol). modify sounds like an in-place operation to me. D3M uses produce which is quite generic but might work (probably fit_produce, produce is their version of both predict and transform)

Update of SLEP002

Any update and news for SLEP002