Coder Social home page Coder Social logo

scikit-learn-contrib / scikit-matter Goto Github PK

View Code? Open in Web Editor NEW
68.0 68.0 17.0 50.63 MB

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities

Home Page: https://scikit-matter.readthedocs.io/en/v0.2.0/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scikit-matter's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

scikit-matter's People

Contributors

agoscinski avatar arthur-lin1027 avatar bhelfrecht avatar ceriottm avatar hurricane642 avatar luthaf avatar picocentauri avatar rosecers avatar sanggyuchong avatar saswatnayak1998 avatar spozdn avatar victorprincipe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-matter's Issues

RidgeRegression2FoldCV cutoff regularization does not cut eigenvalues off

https://github.com/cosmo-epfl/scikit-cosmo/blob/43658d3944e491a847ec571891b4b1375daa6d61/skcosmo/linear_model/_ridge.py#L197
https://github.com/cosmo-epfl/scikit-cosmo/blob/43658d3944e491a847ec571891b4b1375daa6d61/skcosmo/linear_model/_ridge.py#L204
Should be sum and not len. By using len the whole size is used and not only the values which are True.

Simple bug, but it makes the "cutoff" not cutting of any eigenvalues, so no regularization is used except the numerical one due to rcond which is very small.
This influences also the result of reconstruction measures, since this it uses the "cutoff" regularization type.

Examples and tutorials in documentation, technical choices

Splitting up the discussion from #4 (comment).

We will want to have examples and tutorials associated with this repository available somewhere. The goal of this issue is to discuss where such examples should live and which technological choices we want to make.

The first and most obvious solution is to use jupyter notebooks, because they allow mixing code, explanations and figures in a single, easy to edit document. However they have one main drawback: how can we make sure we don't have the usual issue of having small random changes (changes to python version, execution_count, etc.) to spend time reviewing?

I see two options: either we try to maintain the notebook in a "clean" state all the time, for which nbstripout is part of the answer, but not everything (for example it does not remove the "last python version used" field); or we don't care, and ignore these changes when reviewing.

If we are using notebooks for examples, I would actually tend to prefer leaving the output so that just viewing the notebook in github web interface or with binder already show all plots. We would still have to be a bit careful with the notebook size.

Another alternative is to put the example notebooks in kernel-tutorials (or another repo) instead of this repo. We can still have plain python examples in sklearn-cosmo.

What do you think?

Improvements of notebooks related to PR #9

These notebooks are at the moment rudimentary, and not beginner friendly therefore need some improvement. This includes the notebooks:

  • linear_model/plot_orthogonal_regression_nonanalytic_behavior.ipynb a bit hard to understand
  • plot_gfrm.ipynb could include explanation of different inputs for the reconstruction measures
  • plot_lfre.ipynb more explanation

In addition one example regarding the usage of reconstruction measures in missing:

  • plot_pointwise_gfre.py has been left out because it requires RKHS features, which make things more complicated

Check support for Librascal's structure manager object into a general PseudoPoint/SparsePoint kernel class

With "general PseudoPoint/SparsePoint kernel class" I mean a class which also accepts features as array and librascal structure manager

Other possible issues about this implementation:

  • can we put sparse point into the fit method? We would need multiple inputs into fit method, but then it conflicts with other scikit-sklearn utilities (GridSearchCV). Similar when you make X in fit a tuple (X_N, X_M) where X_M are the peudo/sparse points. My hack was to put the pseudo points into the init parameters
  • can we integrate gradients?

Integrate panda DataFrames support for datasets

Resulting issue from the discussion in #18 (comment)

Certain datasets in sklearn offer an as_frame argument which returns the dataset data and target as panda frame e.g. load_iris while others not load_boston. We should also offer this for the datasets here, when it makes sense. Currently, I do not understand why it is not supported for all datasets in sklearn. So there might be issues for certain types of data. This should be checked.

correct pip install syntax

with pip install from github for me had to be
pip install https://github.com/cosmo-epfl/scikit-cosmo/archive/refs/heads/main.zip

the one mentioned in the readme does not work

pip install https://github.com/cosmo-epfl/scikit-cosmo
Collecting https://github.com/cosmo-epfl/scikit-cosmo
Downloading https://github.com/cosmo-epfl/scikit-cosmo
- 180 kB 11.9 MB/s
ERROR: Cannot unpack file /tmp/pip-unpack-wv9lqln4/scikit-cosmo (downloaded from /tmp/pip-req-build-jcs6rwpu, content-type: text/html; charset=utf-8); cannot detect archive format
ERROR: Cannot determine archive format of /tmp/pip-req-build-jcs6rwpu

Reconstruction measures break with Pandas

I use Pandas to handle my data. I tried using the reconstruction measures in skcosmo.metrics on this data, and I got the following error:

>>> res = GRE(X, y)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-20-49fdd3baf61d> in <module>
----> 1 res = GRE(X, y)

~/.local/lib/python3.8/site-packages/skcosmo-0.1.0rc2-py3.8.egg/skcosmo/metrics/_reconstruction_measures.py in pointwise_global_reconstruction_error(X, Y, train_idx, test_idx, scaler, estimator)
     84     )
     85     X_train, X_test, Y_train, Y_test = (
---> 86         X[train_idx],
     87         X[test_idx],
     88         Y[train_idx],

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   3028             if is_iterator(key):
   3029                 key = list(key)
-> 3030             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   3031 
   3032         # take() does not accept boolean indexers

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1265 
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer
   1268 

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1306             if missing == len(indexer):
   1307                 axis_name = self.obj._get_axis_name(axis)
-> 1308                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1309 
   1310             ax = self.obj._get_axis(axis)

KeyError: "None of [Int64Index([7244, 6396, 8502, 7776, 8453, 3175, 3288, 4566, 9549, 8645,\n            ...\n            2898, 7536, 1362, 3751, 2530, 5451, 4733, 6878, 5054, 2796],\n           dtype='int64', length=5110)] are in the [columns]"
[-]

I was able to fix this by casting the input parameters to numpy arrays:

res = GRE(X.to_numpy(), y.to_numpy())

I don't know if Pandas is going to be supported, but until then it might be helpful to mention this limitation in the documentation.

Examples are not tested

We intend to run examples in examples/ in CI to ensure they stay up to date; but our setup leaves a lot to be desired.

  • We are currently trying to run jupyter notebooks using nbconvert --execute, but for some reason I could not understand the resulting scripts are not executed. The nbconvert command works fine locally, so I guess this is down to some configuration difference.

  • #58 is introducing pure python script as examples, which are currently not tested at all. We would need to add them to tox setup as well.

Some tests seems to run multiple times

From https://github.com/lab-cosmo/scikit-cosmo/runs/6552874752?check_suite_focus=true

GLOB sdist-make: /home/runner/work/scikit-cosmo/scikit-cosmo/setup.py
tests create: /home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests
tests installdeps: coverage[toml], parameterized
tests inst: /home/runner/work/scikit-cosmo/scikit-cosmo/.tox/.tmp/package/1/skcosmo-0.1.1.zip
tests installed: coverage==6.4,joblib==1.1.0,numpy==1.22.4,parameterized==0.8.1,scikit-learn==1.1.1,scipy==1.8.1,skcosmo @ file:///home/runner/work/scikit-cosmo/scikit-cosmo/.tox/.tmp/package/1/skcosmo-0.1.1.zip,threadpoolctl==3.1.0,tomli==2.0.1
tests run-test-pre: PYTHONHASHSEED='920532320'
tests run-test: commands[0] | coverage run -m unittest discover -p '*.py'
....................................................../home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py:251: UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead.
  warnings.warn(
.............................................................................../home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/utils/_orthogonalizers.py:57: UserWarning: Column vector contains only zeros.
  warnings.warn("Column vector contains only zeros.")
.........../home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.05 reached.Terminating search at 7 / 10.
  warnings.warn(
/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10.
  warnings.warn(
............................/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.05 reached.Terminating search at 7 / 10.
  warnings.warn(
/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10.
  warnings.warn(
........./home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.05 reached.Terminating search at 7 / 10.
  warnings.warn(
/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10.
  warnings.warn(
.
----------------------------------------------------------------------
Ran 182 tests in 69.540s

OK

The UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10. should only appear once, but is there three times.

Changing https://github.com/lab-cosmo/scikit-cosmo/blob/f433e28a8e5ff13f2dd1aa9a9f2dd2fc3f606218/pyproject.toml#L19 to coverage run -m unittest discover -p "test_sample_simple_fps.py" makes the test run only once, so there might be some strange interaction between unittest, discover and coverage.

Add doctests / inline examples in API reference

We already have some end-to-end examples/tutorials. These show how to accomplish a high level goal with this library ("how do I create a projection of my dataset using KPCovR").

It would be nice to also have "inline" examples inside the reference documentation, showing how to use each function separately ("how do I call FPS.fit with warm_start=True"). For a good example of how this would look like, see https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.transform.

The standard tool to add such example in Python is called doctests: one write input code as if there was a python prompt (>>>); and the corresponding output. Then, there are tools to run the input code and check that output match. Check out the corresponding module in python standard library: https://docs.python.org/3/library/doctest.html

The first task for this issue would be to make sure doctests are part of the test suite checked on CI

We already have some classes with such examples (https://scikit-cosmo.readthedocs.io/en/latest/preprocessing.html#skcosmo.preprocessing.flexible_scaler.KernelNormalizer), but I think all classes should come with such a small example.

The second task for this issue would be to add doctests/examples to all classes in skcosmo.

Missing codecov status on PR

If you click on "show all checks" near the green CI status, the codecov check is missing. Compare this with #38 which still have the check.

Looking through old PR, the last one with the check was #42, and the first one without was #44.

Originally posted by @Luthaf in #59 (comment)

Pick a license

Before other can start using the code in here, we need to pick a license. I like the BSD-3-Clause used by sklearn

BSD 3-Clause License

Copyright (c) 2007-2020 The scikit-learn developers.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

If nobody is opposed to this, I'll add it to this repo!

Setting PCovR regularization is unintuitive

It looks to me like giving the argument alpha in the initialization of PCovR doesn't really do anything. Instead, if one wants to regularize the estimator, a non-default estimator has to be provided. Passing alpha to the instantiation should instead set the estimator's alpha.

It might also be a good idea to rename estimator to regressor in line with sklearn's TransformedTargetRegressor.

Proposed Organization Chart

Before we start migrating from kernel-tutorials, let's decide on an organizational structure to keep things clean.

Here is the proposed plan:

└─── benchmarks
└─── doc
└─── examples
│   README.rst
│   setup.py (to install sklearn-cosmo to python path)
└─── skcosmo
    │   __init__.py
    └─── chemiscope
    │   │    files to convert outputs to chemiscope
    └─── PCovR
    │   │    _base.py
    │   │   KPCovR.py
    │   │   PCovR.py
    │   │   SparseKPcovR.py
    └─── plotting?
    └─── preprocessing
    │   │   kernels.py
    │   │   scalers.py
    └─── selection
    │   │   _base.py
    │   │   CUR.py (incl. PCov-CUR)
    │   │   FPS.py (incl. PCov-FPS)
    └─── sparse (to include SparseKPCovR here or with KPCovR?)
    │   │   SparseKRR.py
    │   │   SparseKPCA.py
    └─── utils
    │   │   general.eig_inv
    │   │   general.normalize_matrix
    │   │   general.center_matrix

Thoughts?

Calculation of the PCovR covariance is incredibly expensive

https://github.com/cosmo-epfl/scikit-cosmo/blob/e078c0efa5546e22274b2fc1432d37199994c8f8/skcosmo/utils/pcovr_utils.py#L60-L77

Since we are using an SVD to just invert the covariance, this is an incredibly expensive operation for tall X, as it calculates also the eigenvectors of the Gram matrix, which we don't need. Since we are using the decomposition to invert the covariance, we probably want the whole thing (instead of just the top n eigenvectors/eigenvalues) and it would probably be a better idea here to do use something like eigh(X.T @ X), at least in the rank >= min(X.shape) case

Re-organize tests into module sub-folders

A continuation of #94 in response to #49. Tests should be moved to the corresponding module subfolder. Hitch is tox discovery, which currently only looks to ./tests/*py. Steps are:

  • Move tests to appropriate subfolders
  • Modify tox framework to find all tests
  • Verify proper CI behavior.

Anything else?

Same index is present multiple times in FPS output during sample selection

Running this code

import numpy as np
import skcosmo.sample_selection

X = np.load("power-spectrum.npy")
print("shape =", X.shape)
print()

fps = skcosmo.sample_selection.FPS(n_to_select=35)
fps.fit(X)
print("selected_idx_ =", fps.selected_idx_)

outputs

shape = (640, 320)

selected_idx_ = [  0 247  45 105  16  56 176  25   9  38 152  72  54 192 131  88  64 168
 212 189 166 202  83 145  96 113 142 125 217 226 233   9   9   9   9]

There are enough samples to select more than 35 of them, but the last one is repeated multiple times in the output, which is unexpected. I'll try to double check the data to see if we are creating the same sample multiple time for some reason.

power-spectrum.npy.zip

Inconsistent behaviour of selectors when n_select_features > rank(all_features)

CUR should not throw an error if n_select_features > rank(all_features)

import numpy as np
from skcosmo.selection import CUR

X = np.ones((10,2))

n_features = 1
fs = CUR.FeatureCUR(X)
fs.select(n_features)
print("No error because rank of X is 1")

# Error because rank of X is < 2
n_features = 2
fs = CUR.FeatureCUR(X)
fs.select(n_features)

with error output

  [...]
  File "/home/goscinsk/miniconda3/lib/python3.8/site-packages/skcosmo/utils/orthogonalizers.py", line 29, in X_orthogonalizer
    raise ValueError("Cannot orthogonalize by a null vector.")
ValueError: Cannot orthogonalize by a null vector.

I would be fine for my usage cases with CUR outputting a warning and adding random features. Also an additional option to stop feature selection in this case would be nice.

Pre-fitted regressors with KPCovR

#98 implements for PCovR the instantiation of the class with a potentially pre-fitted regressor. For consistency, it would be nice to have KPCovR use the same idea, instead of instantiating with the regularization and taking Yhat and W as arguments to fit.

Tolerance in StandardFlexibleScaler should be relative

I have these properties I want to scale before using KPCovR:

Y = [
 [5.500e+00 2.400e-29]
 [3.600e+00 2.270e-28]
 [3.750e+00 2.640e-28]
 [3.020e+00 6.700e-29]
 [9.175e-01 8.300e-32]
]

One column only contains very small values, leading to the following exceptions, which is a bit surprising.

Traceback (most recent call last):
  File "do-kpocvr.py", line 119, in <module>
    Y = StandardFlexibleScaler(column_wise=True).fit_transform(Y)
  File "/opt/miniconda3/lib/python3.7/site-packages/skcosmo/preprocessing/flexible_scaler.py", line 86, in fit_transform
    self.fit(X, y)
  File "/opt/miniconda3/lib/python3.7/site-packages/skcosmo/preprocessing/flexible_scaler.py", line 50, in fit
    raise ValueError("Cannot normalize a feature with zero variance")
ValueError: Cannot normalize a feature with zero variance

I'll change the tol parameter to make this work for now, but I think that instead of checking the absolute value of the variance against the tolerance, we should test the relative variance, i.e. replace np.any(var < self.tol) with np.any(var / np.mean(Y, axis=0) < self.tol) in https://github.com/cosmo-epfl/scikit-cosmo/blob/64969332914aa956e943c8f1824699731510d72d/skcosmo/preprocessing/flexible_scaler.py#L60-L62

PCovR small singular values should be handled more consistently

The handling of small singular values isn't particularly consistent between sample_space and feature_space -- this leads to issues particularly when doing sample_space PCovR when the mixing is set to zero and the number of components is greater than the number of regression targets. The singular values less than tol don't seem to actually get thrown out (or set to zero), and the prediction becomes unstable, yielding completely different results between repeated fit calls on the same data.

Work of test_kpcovr.py

Good afternoon! When running the tests, I found a strange behavior. Rarely enough (1 run out of 10-20) test_kpcovr.py gives an error:
Traceback (most recent call last):
File "/home/runner/work/scikit-cosmo/scikit-cosmo/tests/test_kpcovr.py", line 248, in test_linear_matches_pcovr
self.assertEqual(
AssertionError: 0.75 != 0.751

We probably need to put a large tolerance to difference there.

GreedySelector can not be serialized with pickle

I'm using pickle to save the full state of a model to the disk and load it later.

Unfortunately GreedySelector does not currently support pickle:

from skcosmo.feature_selection import FPS
import pickle

fps = FPS(n_features_to_select=23)

# ... do stuff ...

with open("saved-fps.pickle", "wb") as fd:
    pickle.dump(fps, fd)

This results in

Traceback (most recent call last):
  File "pickle-test.py", line 9, in <module>
    pickle.dump(fps, fd)
AttributeError: Can't pickle local object 'GreedySelector.__init__.<locals>.<lambda>'

I think we should add support for pickle since it is a tool commonly used for Python projects.


We should also check that all other classes support being serialized with pickle. If needed, we can customize the code used to save/load a given class: https://docs.python.org/3/library/pickle.html#pickle-inst

Dataset handling

I noticed that there's a lot of discussion on how to incorporate datasets in several PR, so perhaps it's best to discuss it as a general issue. I think that the best would be to have a separate, persistent storage solution to host them, and then have a script to download them. materialscloud or zenodo come to mind. we can discuss with Giovanni Pizzi to see if we should rely on a one-doi-one-dataset construct, or have a schema to access individual files within a record

Folder and file naming convention

To be closer to the naming convention of sklearn some propositions how to change the naming:

subpackage naming

  • pcovr -> decomposition
  • selection -> feature_selection

filename convention in subpackages:

  • filename.py -> _filename.py (this is not that important I guess)
  • base classes of a subpackage to a _base.py

test naming convention:

  • subpackage/test_filename.py (I wouldn't be strict about the class structure, one test class might cover multiple classes)

Redesign of DCH to separate interpolator and hull algorithm

discussed with @victorprincipe a possible redesign of GCH in some future to have additional functionalities and more flexibility. At the moment it is just used scorer. It has hard to make a estimator out of the GCH, because to be consistent with the score function we would need to predict X_HD, but in the current fit function X_HD is part of the X input, so it would be an inconsistent in fit and predict function if we predict X_HD.

So we thought about separating the current DCH directional convex hull and the interpolator part into two separated classes and bringing them then together

class DCH(TransformerMixin, BaseEstimator) # maybe ClusterMixin
  def __init__(self, low_dim_idx)
  def fit(self, X_LD, y)
  def predict(self, X_LD) # -> y (linear interpolation on directional convex hull)
  def transform(self, X_LD) # -> X[self.selected_idx_] (directional convex hull vertices)
  
class Interpolator(BaseEstimator):
  def __init__(self, interpolator_type)
  def fit(self, X_LD, X_HD)
  def predict(self, X_LD) # -> X_HD

# notebook example
dch = DCH().fit(X_LD, y)
interp = Interpolator().fit(dch.transform(X_LD), X_HD)
interp.score(X_LD, X_HD)

I feel like the last part cannot be abstracted and has to be just a recipe in an example notebook because it is such a specific use case.

Setup documentation and deploy it

We will need to have documentation available online. I suggest we go with the standard Python tool for documentation, i.e. sphinx, which can extract and render docstrings.

We will need the following part of documentation

  • Introduction and how to install & use the code
  • Tutorials: i.e how to use the code in simple cases. This is currently done in notebooks in kernel-tutorials. We could refer to these notebooks in a way or another for tutorials EDIT: kernel-tutorial is different, what we need here are "How To" examples
  • API reference, i.e. how to use the code in details. This will be taken from the docstrings

Regarding deployment, the two main options we have are github pages (the doc would live at https://cosmo-epfl.github.io/sklearn-cosmo/ by default) or readthedocs (https://sklearn-cosmo.readthedocs.io/ by default).

If anyone wants to have a go at setting up the documentation, I'm happy to mentor them. Else I'll do something in the next few weeks.

Feature Request: batched approach to PCov based selectors.

For applications, in which reconstructing the modified Gram matrix at each selection step is not practical, it would be great to have the option of selecting batches of features/samples at a time before reconstruction the modified Gram matrix.

PCovR is not centering like PCA

import numpy as np
from skcosmo.decomposition import PCovR
from sklearn.decomposition import PCA

np.random.seed(0)

X = np.random.rand(10,3)+np.array([1,1,1])
y = np.random.rand(10,1) 

# transformation on non centered X
pcovr = PCovR(mixing=1).fit(X, y)
pca = PCA().fit(X)
X_povr_t = pcovr.transform(X)
X_pca_t = pca.transform(X)

# transformation on centered X
X -= np.mean(X,axis=0)[None,:]
pcovr = PCovR(mixing=1).fit(X, y)
pca = PCA().fit(X)
X_povr_t_c = pcovr.transform(X)
X_pca_t_c = pca.transform(X)

print("Difference in transformation between noncentered and centered data")
print("sklearn PCA", np.linalg.norm(X_pca_t-X_pca_t_c))
print("skcosmo PCovR (alpha=1) = PCA", np.linalg.norm(X_povr_t-X_povr_t_c))
Out:
Difference in transformation between noncentered and centered data
sklearn PCA 1.3922169130898628e-15
skcosmo PCovR (alpha=1) = PCA 2.3242628802840968

Computing the mean here https://github.com/lab-cosmo/scikit-cosmo/blob/7ef05ebc73e5ef016d53f3aee1069333c07a9933/skcosmo/decomposition/_pcovr.py#L271
but it is never centering the features

EDIT: this does not affect any results, because we always use the Standardizer before

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.