scikit-learn-contrib / scikit-matter Goto Github PK

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities

Home Page: https://scikit-matter.readthedocs.io/en/v0.2.0/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scikit-matter's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

DenMune: Density-peak clustering using mutual nearest neighbors

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

lightning

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

skglm

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

imbalanced-learn

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

polylearn

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

hdbscan

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

categorical-encoding

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

boruta_py

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

sklearn-pandas

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

skope-rules

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

stability-selection

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

metric-learn

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

scikit-matter's People

Contributors

Stargazers

Watchers

Forkers

luthaf hurricane642 agoscinski bhelfrecht andreanelli pgasparo arnabmaj spozdn m-stack-org digitalmoleculardesign bananenpampe lab-cosmo raghurama123 picocentauri sanggyuchong gardevoirx jwa7

scikit-matter's Issues

RidgeRegression2FoldCV cutoff regularization does not cut eigenvalues off

https://github.com/cosmo-epfl/scikit-cosmo/blob/43658d3944e491a847ec571891b4b1375daa6d61/skcosmo/linear_model/_ridge.py#L197
https://github.com/cosmo-epfl/scikit-cosmo/blob/43658d3944e491a847ec571891b4b1375daa6d61/skcosmo/linear_model/_ridge.py#L204
Should be sum and not len. By using len the whole size is used and not only the values which are True.

Simple bug, but it makes the "cutoff" not cutting of any eigenvalues, so no regularization is used except the numerical one due to rcond which is very small.
This influences also the result of reconstruction measures, since this it uses the "cutoff" regularization type.

Feature Request: KPCov based sample selection including more than one property in the construction of the modified Gram matrix

For the purpose of selecting samples for the construction of a ML potential trained on energy as well as force information, it would likely be beneficial to include force information also in the construction of the modified Gram matrix entering the (K)PCov approaches.

Examples and tutorials in documentation, technical choices

Splitting up the discussion from #4 (comment).

We will want to have examples and tutorials associated with this repository available somewhere. The goal of this issue is to discuss where such examples should live and which technological choices we want to make.

The first and most obvious solution is to use jupyter notebooks, because they allow mixing code, explanations and figures in a single, easy to edit document. However they have one main drawback: how can we make sure we don't have the usual issue of having small random changes (changes to python version, execution_count, etc.) to spend time reviewing?

I see two options: either we try to maintain the notebook in a "clean" state all the time, for which nbstripout is part of the answer, but not everything (for example it does not remove the "last python version used" field); or we don't care, and ignore these changes when reviewing.

If we are using notebooks for examples, I would actually tend to prefer leaving the output so that just viewing the notebook in github web interface or with binder already show all plots. We would still have to be a bit careful with the notebook size.

Another alternative is to put the example notebooks in kernel-tutorials (or another repo) instead of this repo. We can still have plain python examples in sklearn-cosmo.

What do you think?

Improvements of notebooks related to PR #9

These notebooks are at the moment rudimentary, and not beginner friendly therefore need some improvement. This includes the notebooks:

linear_model/plot_orthogonal_regression_nonanalytic_behavior.ipynb a bit hard to understand
plot_gfrm.ipynb could include explanation of different inputs for the reconstruction measures
plot_lfre.ipynb more explanation

In addition one example regarding the usage of reconstruction measures in missing:

plot_pointwise_gfre.py has been left out because it requires RKHS features, which make things more complicated

Check support for Librascal's structure manager object into a general PseudoPoint/SparsePoint kernel class

With "general PseudoPoint/SparsePoint kernel class" I mean a class which also accepts features as array and librascal structure manager

Other possible issues about this implementation:

can we put sparse point into the fit method? We would need multiple inputs into fit method, but then it conflicts with other scikit-sklearn utilities (GridSearchCV). Similar when you make X in fit a tuple (X_N, X_M) where X_M are the peudo/sparse points. My hack was to put the pseudo points into the init parameters
can we integrate gradients?

How to integrate SparseMethods with the sklearn pipeline

The Sparse*Methods* have two inputs and might not work with the sklearn pipeline. Need to check if it works. If it does not work, the pipeline has to be adapted.

Integrate panda DataFrames support for datasets

Resulting issue from the discussion in #18 (comment)

Certain datasets in sklearn offer an as_frame argument which returns the dataset data and target as panda frame e.g. load_iris while others not load_boston. We should also offer this for the datasets here, when it makes sense. Currently, I do not understand why it is not supported for all datasets in sklearn. So there might be issues for certain types of data. This should be checked.

correct pip install syntax

with pip install from github for me had to be
pip install https://github.com/cosmo-epfl/scikit-cosmo/archive/refs/heads/main.zip

the one mentioned in the readme does not work

pip install https://github.com/cosmo-epfl/scikit-cosmo
Collecting https://github.com/cosmo-epfl/scikit-cosmo
Downloading https://github.com/cosmo-epfl/scikit-cosmo
- 180 kB 11.9 MB/s
ERROR: Cannot unpack file /tmp/pip-unpack-wv9lqln4/scikit-cosmo (downloaded from /tmp/pip-req-build-jcs6rwpu, content-type: text/html; charset=utf-8); cannot detect archive format
ERROR: Cannot determine archive format of /tmp/pip-req-build-jcs6rwpu

Examples are broken on docs

https://scikit-cosmo.readthedocs.io/en/latest/read-only-examples/PCovR.html

Improve Tests for Selection Methods

Missing Documentation for Reconstruction Measures

@agoscinski there needs to be a readthedocs page for the models included in #9 that is well-documented and included in the sphinx toctree.

Reconstruction measures break with Pandas

I use Pandas to handle my data. I tried using the reconstruction measures in skcosmo.metrics on this data, and I got the following error:

>>> res = GRE(X, y)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-20-49fdd3baf61d> in <module>
----> 1 res = GRE(X, y)

~/.local/lib/python3.8/site-packages/skcosmo-0.1.0rc2-py3.8.egg/skcosmo/metrics/_reconstruction_measures.py in pointwise_global_reconstruction_error(X, Y, train_idx, test_idx, scaler, estimator)
     84     )
     85     X_train, X_test, Y_train, Y_test = (
---> 86         X[train_idx],
     87         X[test_idx],
     88         Y[train_idx],

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   3028             if is_iterator(key):
   3029                 key = list(key)
-> 3030             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   3031 
   3032         # take() does not accept boolean indexers

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1265 
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer
   1268 

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1306             if missing == len(indexer):
   1307                 axis_name = self.obj._get_axis_name(axis)
-> 1308                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1309 
   1310             ax = self.obj._get_axis(axis)

KeyError: "None of [Int64Index([7244, 6396, 8502, 7776, 8453, 3175, 3288, 4566, 9549, 8645,\n            ...\n            2898, 7536, 1362, 3751, 2530, 5451, 4733, 6878, 5054, 2796],\n           dtype='int64', length=5110)] are in the [columns]"
[-]

I was able to fix this by casting the input parameters to numpy arrays:

res = GRE(X.to_numpy(), y.to_numpy())

I don't know if Pandas is going to be supported, but until then it might be helpful to mention this limitation in the documentation.

Two bugs in Selection.CUR

Bug 1:

Bug 2：

Examples are not tested

We intend to run examples in examples/ in CI to ensure they stay up to date; but our setup leaves a lot to be desired.

We are currently trying to run jupyter notebooks using nbconvert --execute, but for some reason I could not understand the resulting scripts are not executed. The nbconvert command works fine locally, so I guess this is down to some configuration difference.
#58 is introducing pure python script as examples, which are currently not tested at all. We would need to add them to tox setup as well.

Update of SampleSelection.ipynb

This notebook is no longer relevant, we need to bring it up to date with the current functionality.

Should we write selection methods in fit / transform fashion?

Looking through sklearn's selection methods, they do fit/transform syntax. I'm somewhat of two minds on this: if we do the same, all transform does is returns X[:, features] or X[samples], which feels pretty useless. I could see renaming select to fit and moving X and Y out of the constructor (which would probably be wise). @ceriottm, @bhelfrecht, and @hurricane642 would appreciate your input.

Implement pythonized toolbox for autocorrelation analysis

Some tests seems to run multiple times

From https://github.com/lab-cosmo/scikit-cosmo/runs/6552874752?check_suite_focus=true

GLOB sdist-make: /home/runner/work/scikit-cosmo/scikit-cosmo/setup.py
tests create: /home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests
tests installdeps: coverage[toml], parameterized
tests inst: /home/runner/work/scikit-cosmo/scikit-cosmo/.tox/.tmp/package/1/skcosmo-0.1.1.zip
tests installed: coverage==6.4,joblib==1.1.0,numpy==1.22.4,parameterized==0.8.1,scikit-learn==1.1.1,scipy==1.8.1,skcosmo @ file:///home/runner/work/scikit-cosmo/scikit-cosmo/.tox/.tmp/package/1/skcosmo-0.1.1.zip,threadpoolctl==3.1.0,tomli==2.0.1
tests run-test-pre: PYTHONHASHSEED='920532320'
tests run-test: commands[0] | coverage run -m unittest discover -p '*.py'
....................................................../home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py:251: UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead.
  warnings.warn(
.............................................................................../home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/utils/_orthogonalizers.py:57: UserWarning: Column vector contains only zeros.
  warnings.warn("Column vector contains only zeros.")
.........../home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.05 reached.Terminating search at 7 / 10.
  warnings.warn(
/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10.
  warnings.warn(
............................/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.05 reached.Terminating search at 7 / 10.
  warnings.warn(
/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10.
  warnings.warn(
........./home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.05 reached.Terminating search at 7 / 10.
  warnings.warn(
/home/runner/work/scikit-cosmo/scikit-cosmo/.tox/tests/lib/python3.9/site-packages/skcosmo/_selection.py:214: UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10.
  warnings.warn(
.
----------------------------------------------------------------------
Ran 182 tests in 69.540s

OK

The UserWarning: Score threshold of 0.4 reached.Terminating search at 6 / 10. should only appear once, but is there three times.

Changing https://github.com/lab-cosmo/scikit-cosmo/blob/f433e28a8e5ff13f2dd1aa9a9f2dd2fc3f606218/pyproject.toml#L19 to coverage run -m unittest discover -p "test_sample_simple_fps.py" makes the test run only once, so there might be some strange interaction between unittest, discover and coverage.

Add doctests / inline examples in API reference

We already have some end-to-end examples/tutorials. These show how to accomplish a high level goal with this library ("how do I create a projection of my dataset using KPCovR").

It would be nice to also have "inline" examples inside the reference documentation, showing how to use each function separately ("how do I call FPS.fit with warm_start=True"). For a good example of how this would look like, see https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.transform.

The standard tool to add such example in Python is called doctests: one write input code as if there was a python prompt (>>>); and the corresponding output. Then, there are tools to run the input code and check that output match. Check out the corresponding module in python standard library: https://docs.python.org/3/library/doctest.html

The first task for this issue would be to make sure doctests are part of the test suite checked on CI

We already have some classes with such examples (https://scikit-cosmo.readthedocs.io/en/latest/preprocessing.html#skcosmo.preprocessing.flexible_scaler.KernelNormalizer), but I think all classes should come with such a small example.

The second task for this issue would be to add doctests/examples to all classes in skcosmo.

Documentation needed for new developers in adding datasets

Solution to Issue #15

Missing codecov status on PR

If you click on "show all checks" near the green CI status, the codecov check is missing. Compare this with #38 which still have the check.

Looking through old PR, the last one with the check was #42, and the first one without was #44.

Originally posted by @Luthaf in #59 (comment)

Pick a license

Before other can start using the code in here, we need to pick a license. I like the BSD-3-Clause used by sklearn

BSD 3-Clause License

Copyright (c) 2007-2020 The scikit-learn developers.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

If nobody is opposed to this, I'll add it to this repo!

Setting PCovR regularization is unintuitive

It looks to me like giving the argument alpha in the initialization of PCovR doesn't really do anything. Instead, if one wants to regularize the estimator, a non-default estimator has to be provided. Passing alpha to the instantiation should instead set the estimator's alpha.

It might also be a good idea to rename estimator to regressor in line with sklearn's TransformedTargetRegressor.

Proposed Organization Chart

Before we start migrating from kernel-tutorials, let's decide on an organizational structure to keep things clean.

Here is the proposed plan:

└─── benchmarks
└─── doc
└─── examples
│   README.rst
│   setup.py (to install sklearn-cosmo to python path)
└─── skcosmo
    │   __init__.py
    └─── chemiscope
    │   │    files to convert outputs to chemiscope
    └─── PCovR
    │   │    _base.py
    │   │   KPCovR.py
    │   │   PCovR.py
    │   │   SparseKPcovR.py
    └─── plotting?
    └─── preprocessing
    │   │   kernels.py
    │   │   scalers.py
    └─── selection
    │   │   _base.py
    │   │   CUR.py (incl. PCov-CUR)
    │   │   FPS.py (incl. PCov-FPS)
    └─── sparse (to include SparseKPCovR here or with KPCovR?)
    │   │   SparseKRR.py
    │   │   SparseKPCA.py
    └─── utils
    │   │   general.eig_inv
    │   │   general.normalize_matrix
    │   │   general.center_matrix

Thoughts?

PCovR feature space treats singular values incorrectly

https://github.com/cosmo-epfl/scikit-cosmo/blob/e078c0efa5546e22274b2fc1432d37199994c8f8/skcosmo/decomposition/pcovr.py#L377-L395

Since we are doing an SVD on the modified covariance matrix here, the singular values are equal to the eigenvalues, but they are currently treated as if they are the square roots of the eigenvalues.

Add descripton of the chemical properties CSD-1000 dataset

there are properties in the dataset, but they are not mentioned in the description

Fix all of the #noqa in the documentation for FRM and friends, including RidgeRegresion2FoldCV

|y - Xw|^2_2 + alpha |w|^2_2, % # noqa: W605

Please see above, code originally written by @agoscinski

Calculation of the PCovR covariance is incredibly expensive

https://github.com/cosmo-epfl/scikit-cosmo/blob/e078c0efa5546e22274b2fc1432d37199994c8f8/skcosmo/utils/pcovr_utils.py#L60-L77

Since we are using an SVD to just invert the covariance, this is an incredibly expensive operation for tall X, as it calculates also the eigenvectors of the Gram matrix, which we don't need. Since we are using the decomposition to invert the covariance, we probably want the whole thing (instead of just the top n eigenvectors/eigenvalues) and it would probably be a better idea here to do use something like eigh(X.T @ X), at least in the rank >= min(X.shape) case

Re-organize tests into module sub-folders

A continuation of #94 in response to #49. Tests should be moved to the corresponding module subfolder. Hitch is tox discovery, which currently only looks to ./tests/*py. Steps are:

Move tests to appropriate subfolders
Modify tox framework to find all tests
Verify proper CI behavior.

Anything else?

Same index is present multiple times in FPS output during sample selection

Running this code

import numpy as np
import skcosmo.sample_selection

X = np.load("power-spectrum.npy")
print("shape =", X.shape)
print()

fps = skcosmo.sample_selection.FPS(n_to_select=35)
fps.fit(X)
print("selected_idx_ =", fps.selected_idx_)

outputs

shape = (640, 320)

selected_idx_ = [  0 247  45 105  16  56 176  25   9  38 152  72  54 192 131  88  64 168
 212 189 166 202  83 145  96 113 142 125 217 226 233   9   9   9   9]

There are enough samples to select more than 35 of them, but the last one is repeated multiple times in the output, which is unexpected. I'll try to double check the data to see if we are creating the same sample multiple time for some reason.

power-spectrum.npy.zip

Inconsistent behaviour of selectors when n_select_features > rank(all_features)

CUR should not throw an error if n_select_features > rank(all_features)

import numpy as np
from skcosmo.selection import CUR

X = np.ones((10,2))

n_features = 1
fs = CUR.FeatureCUR(X)
fs.select(n_features)
print("No error because rank of X is 1")

# Error because rank of X is < 2
n_features = 2
fs = CUR.FeatureCUR(X)
fs.select(n_features)

with error output

  [...]
  File "/home/goscinsk/miniconda3/lib/python3.8/site-packages/skcosmo/utils/orthogonalizers.py", line 29, in X_orthogonalizer
    raise ValueError("Cannot orthogonalize by a null vector.")
ValueError: Cannot orthogonalize by a null vector.

I would be fine for my usage cases with CUR outputting a warning and adding random features. Also an additional option to stop feature selection in this case would be nice.

Pre-fitted regressors with KPCovR

#98 implements for PCovR the instantiation of the class with a potentially pre-fitted regressor. For consistency, it would be nice to have KPCovR use the same idea, instead of instantiating with the regularization and taking Yhat and W as arguments to fit.

Remove instances of utils.eig_solver in favor of sklearn eigensolver syntax

Tolerance in StandardFlexibleScaler should be relative

I have these properties I want to scale before using KPCovR:

Y = [
 [5.500e+00 2.400e-29]
 [3.600e+00 2.270e-28]
 [3.750e+00 2.640e-28]
 [3.020e+00 6.700e-29]
 [9.175e-01 8.300e-32]
]

One column only contains very small values, leading to the following exceptions, which is a bit surprising.

Traceback (most recent call last):
  File "do-kpocvr.py", line 119, in <module>
    Y = StandardFlexibleScaler(column_wise=True).fit_transform(Y)
  File "/opt/miniconda3/lib/python3.7/site-packages/skcosmo/preprocessing/flexible_scaler.py", line 86, in fit_transform
    self.fit(X, y)
  File "/opt/miniconda3/lib/python3.7/site-packages/skcosmo/preprocessing/flexible_scaler.py", line 50, in fit
    raise ValueError("Cannot normalize a feature with zero variance")
ValueError: Cannot normalize a feature with zero variance

I'll change the tol parameter to make this work for now, but I think that instead of checking the absolute value of the variance against the tolerance, we should test the relative variance, i.e. replace np.any(var < self.tol) with np.any(var / np.mean(Y, axis=0) < self.tol) in https://github.com/cosmo-epfl/scikit-cosmo/blob/64969332914aa956e943c8f1824699731510d72d/skcosmo/preprocessing/flexible_scaler.py#L60-L62

Examples notebook contain extraneous TOC

See the following pages:

https://scikit-cosmo.readthedocs.io/en/latest/read-only-examples/Selectors%2BPipelines.py.html

https://scikit-cosmo.readthedocs.io/en/latest/read-only-examples/PlotGFRE.html

https://scikit-cosmo.readthedocs.io/en/latest/read-only-examples/PlotLFRE.html

@rosecers is there a way to remove the autogenerated TOC before building the documentation?

PCovR small singular values should be handled more consistently

The handling of small singular values isn't particularly consistent between sample_space and feature_space -- this leads to issues particularly when doing sample_space PCovR when the mixing is set to zero and the number of components is greater than the number of regression targets. The singular values less than tol don't seem to actually get thrown out (or set to zero), and the prediction becomes unstable, yielding completely different results between repeated fit calls on the same data.

Work of test_kpcovr.py

Good afternoon! When running the tests, I found a strange behavior. Rarely enough (1 run out of 10-20) test_kpcovr.py gives an error:
Traceback (most recent call last):
File "/home/runner/work/scikit-cosmo/scikit-cosmo/tests/test_kpcovr.py", line 248, in test_linear_matches_pcovr
self.assertEqual(
AssertionError: 0.75 != 0.751

We probably need to put a large tolerance to difference there.

GreedySelector can not be serialized with pickle

I'm using pickle to save the full state of a model to the disk and load it later.

Unfortunately GreedySelector does not currently support pickle:

from skcosmo.feature_selection import FPS
import pickle

fps = FPS(n_features_to_select=23)

# ... do stuff ...

with open("saved-fps.pickle", "wb") as fd:
    pickle.dump(fps, fd)

This results in

Traceback (most recent call last):
  File "pickle-test.py", line 9, in <module>
    pickle.dump(fps, fd)
AttributeError: Can't pickle local object 'GreedySelector.__init__.<locals>.<lambda>'

I think we should add support for pickle since it is a tool commonly used for Python projects.

We should also check that all other classes support being serialized with pickle. If needed, we can customize the code used to save/load a given class: https://docs.python.org/3/library/pickle.html#pickle-inst

Issue with loading the datasets in readthedocs

https://scikit-cosmo.readthedocs.io/en/latest/read-only-examples/PlotGFRE.html#

See attachment for current errors
Global Feature Reconstruction Error (GFRE) and Distortion (GFRD) — scikit-COSMO 0.1.0 documentation.pdf

Feature Request: make selectors amenable to GPU processing

as requested by @Luthaf

Dataset handling

I noticed that there's a lot of discussion on how to incorporate datasets in several PR, so perhaps it's best to discuss it as a general issue. I think that the best would be to have a separate, persistent storage solution to host them, and then have a script to download them. materialscloud or zenodo come to mind. we can discuss with Giovanni Pizzi to see if we should rely on a one-doi-one-dataset construct, or have a schema to access individual files within a record

Folder and file naming convention

To be closer to the naming convention of sklearn some propositions how to change the naming:

subpackage naming

pcovr -> decomposition
selection -> feature_selection

filename convention in subpackages:

filename.py -> _filename.py (this is not that important I guess)
base classes of a subpackage to a _base.py

test naming convention:

subpackage/test_filename.py (I wouldn't be strict about the class structure, one test class might cover multiple classes)

Public benchmarks for RidgeRegression2FoldCV

So it would be nice to have some public results backing up claims about efficiency, especially since sklearn also has some efficient implementation with RigdeCV. I have made some internal benchmarks, but this should be public available at some point somehow.

Reference:
Scikit has their own benchmark repo https://github.com/scikit-learn/scikit-learn-benchmarks

Redesign of DCH to separate interpolator and hull algorithm

discussed with @victorprincipe a possible redesign of GCH in some future to have additional functionalities and more flexibility. At the moment it is just used scorer. It has hard to make a estimator out of the GCH, because to be consistent with the score function we would need to predict X_HD, but in the current fit function X_HD is part of the X input, so it would be an inconsistent in fit and predict function if we predict X_HD.

So we thought about separating the current DCH directional convex hull and the interpolator part into two separated classes and bringing them then together

class DCH(TransformerMixin, BaseEstimator) # maybe ClusterMixin
  def __init__(self, low_dim_idx)
  def fit(self, X_LD, y)
  def predict(self, X_LD) # -> y (linear interpolation on directional convex hull)
  def transform(self, X_LD) # -> X[self.selected_idx_] (directional convex hull vertices)
  
class Interpolator(BaseEstimator):
  def __init__(self, interpolator_type)
  def fit(self, X_LD, X_HD)
  def predict(self, X_LD) # -> X_HD

# notebook example
dch = DCH().fit(X_LD, y)
interp = Interpolator().fit(dch.transform(X_LD), X_HD)
interp.score(X_LD, X_HD)

I feel like the last part cannot be abstracted and has to be just a recipe in an example notebook because it is such a specific use case.

Output warning if using skcosmo without the proper sklearn version

Hello, I had problems with skcosmo due to an older (but not very old) sklearn version. We should return a warning if the used sklearn version is not 0.24.2 and also mention this in the README, because otherwise it is nearly impossible to find out.

Setup documentation and deploy it

We will need to have documentation available online. I suggest we go with the standard Python tool for documentation, i.e. sphinx, which can extract and render docstrings.

We will need the following part of documentation

Introduction and how to install & use the code
Tutorials: i.e how to use the code in simple cases. ~~This is currently done in notebooks in kernel-tutorials. We could refer to these notebooks in a way or another for tutorials~~ EDIT: kernel-tutorial is different, what we need here are "How To" examples
API reference, i.e. how to use the code in details. This will be taken from the docstrings

Regarding deployment, the two main options we have are github pages (the doc would live at https://cosmo-epfl.github.io/sklearn-cosmo/ by default) or readthedocs (https://sklearn-cosmo.readthedocs.io/ by default).

If anyone wants to have a go at setting up the documentation, I'm happy to mentor them. Else I'll do something in the next few weeks.

import numpy as np
from skcosmo.decomposition import PCovR
from sklearn.decomposition import PCA

np.random.seed(0)

X = np.random.rand(10,3)+np.array([1,1,1])
y = np.random.rand(10,1) 

# transformation on non centered X
pcovr = PCovR(mixing=1).fit(X, y)
pca = PCA().fit(X)
X_povr_t = pcovr.transform(X)
X_pca_t = pca.transform(X)

# transformation on centered X
X -= np.mean(X,axis=0)[None,:]
pcovr = PCovR(mixing=1).fit(X, y)
pca = PCA().fit(X)
X_povr_t_c = pcovr.transform(X)
X_pca_t_c = pca.transform(X)

print("Difference in transformation between noncentered and centered data")
print("sklearn PCA", np.linalg.norm(X_pca_t-X_pca_t_c))
print("skcosmo PCovR (alpha=1) = PCA", np.linalg.norm(X_povr_t-X_povr_t_c))

Out:
Difference in transformation between noncentered and centered data
sklearn PCA 1.3922169130898628e-15
skcosmo PCovR (alpha=1) = PCA 2.3242628802840968

Computing the mean here https://github.com/lab-cosmo/scikit-cosmo/blob/7ef05ebc73e5ef016d53f3aee1069333c07a9933/skcosmo/decomposition/_pcovr.py#L271
but it is never centering the features

EDIT: this does not affect any results, because we always use the Standardizer before