nrdg / groupyr Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 6.0 9.31 MB

groupyr: Sparse Group Lasso in Python

Home Page: https://richford.github.io/groupyr

License: BSD 3-Clause "New" or "Revised" License

Shell 1.17% Python 96.03% Makefile 0.56% TeX 2.24%

groupyr's People

Contributors

Stargazers

Watchers

Forkers

arokem cruyffturn stephenhb a-krawciw eegkit richford

groupyr's Issues

sklearn.linear_model._coordinate_descent._alpha_grid does not have a normalize argument anymore

When running:

estimator = SGLCV(groups=groups
                  n_jobs=n_jobs,
                  random_state=rng,
                  suppress_solver_warnings=False,
                  l1_ratio=1,
                  normalize=True
                  )

estimator.fit(X,y)

I get the following error:

TypeError: _alpha_grid() got an unexpected keyword argument 'normalize'

I search for recent commits in sklearn, and with this commit, sklearn.linear_model._coordinate_descent._alpha_grid does not accept the normalize parameter anymore.

Allow transformers to return the intersection of groups

Describe the workflow you want to enable

GroupShuffler, GroupRemover and GroupExtractor return the union of group labels if a sequence is passed for the select parameter. The user should be able to get the intersection of group labels also.

Describe your proposed solution

Add a select_intersection kwarg to all of the above transformers.

Add usage guide to documentation

Add brief motivation about why one might want to use SGL.
Talk about specification of the groups parameter.
Talk about the SGL class.
Talk about the LogisticSGL class.
Talk about the CV estimators.
Talk about the dataset generators.
Make liberal use of links to the sklearn documentation.

Add FAQ page to documentation

Questions:

Why groupyr? There are a few other implementations of SGL out there. Why did we bother making another?
Answer should highlight the performance differences and additional CV capabilities. Also hint at future penalties like Fused SGL.
How do we pronounce groupyr? It's like the fish, "grouper"

What else?

Accept more formats for the `groups` parameter

The common groups parameter currently accepts the explicit standard group format (i.e. a list of numpy arrays where each array represents a group and each element of the array represents a feature index that belongs to that group) or None, in which case all features are assigned to one group.

We should also allow users to input a single integer s, in which case we will create contiguous groups of size s until all features are accounted for.

@arokem, @mnarayan Any other input behaviors we'd like to try?

Extracting the regularisation path and coefficients along it

A neat feature of the LASSO and its software implementation is the ability to extract the regularisation path, including the coefficients along the way. It appears that sgl_path() offers such functionality, but it is either not exposed to the user or its usage is not described in the docs. Would it be possible to add an example of how to extract coefficients along the regularisation path?

Make_classification description matches sklearn

Describe the issue linked to the documentation

The first paragraph of the description of the function make_classification matches the sklearn description exactly.

Groupyr:
https://github.com/richford/groupyr/blob/main/groupyr/datasets.py

sklearn:
https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/datasets/_samples_generator.py#L39

Maybe we can add a reference or quotation marks?

Add support for weighted or adaptive penalties

Migrating from yeatmanlab/AFQ-Insight#38

`make_group_classification` yields different results even with same `random_state`

We should use generator.choice instead of np.random.choice

Add more examples to documentation

We should add more examples to the documentation. Some ideas:

An example plotting the regularization path of SGL
An example with SGL (no CV)
An example with LogisticSGL (no CV)
An example using SGLCV in a pipeline
An example using SGL as a transformer in a pipeline to another model

Implement Fused SGL

Migrated from yeatmanlab/AFQ-Insight#37

Consider using:

KeyError accessing cv_results_ dict

Using the following function (inspired by this notebook)

def get_cv_results(n_repeats=5, n_splits=10,
                   shuffle=False,
                   ensembler=None,
                   target_transform_func=None,
                   target_transform_inverse_func=None,
                   n_estimators=10,
                   trim_nodes=0,
                   square_features=False):
    if shuffle:
        rng = np.random.default_rng()
        y_fit = rng.permutation(y)
    else:
        y_fit = np.copy(y)
        
    if trim_nodes > 0:
        grp_mask = np.zeros_like(groups[0], dtype=bool)
        grp_mask[trim_nodes:-trim_nodes] = True
        X_mask = np.concatenate([grp_mask] * len(groups))

        groups_trim = []
        start_idx = 0
        
        for grp in groups:
            stop_idx = start_idx + len(grp) - 2 * trim_nodes
            groups_trim.append(np.arange(start_idx, stop_idx))
            start_idx += len(grp) - 2 * trim_nodes
            
        X_trim = X[:, X_mask]
    elif trim_nodes == 0:
        groups_trim = [grp for grp in groups]
        X_trim = np.copy(X)
    else:
        raise ValueError("trim_nodes must be non-negative.")
        
    if square_features:
        _n_samples, _n_features = X_trim.shape
        X_trim = np.hstack([X_trim, np.square(X_trim)])
        groups_trim = [np.concatenate([g, g + _n_features]) for g in groups_trim]
    
    cv = RepeatedKFold(
        n_splits=n_splits,
        n_repeats=n_repeats,
        random_state=1729
    )

    cv_results = {}
    pipe_skopt = afqi.make_afq_regressor_pipeline(
        imputer_kwargs={"strategy": "median"},
        use_cv_estimator=True,
        scaler="standard",
        groups=groups_trim,
        verbose=0,
        pipeline_verbosity=False,
        tuning_strategy="bayes",
        cv=3,
        n_bayes_points=9,
        n_jobs=28,
        l1_ratio=[0.0, 1.0],
        eps=5e-2,
        n_alphas=100,
        ensemble_meta_estimator=ensembler,
        ensemble_meta_estimator_kwargs={
            "n_estimators": n_estimators,
            "n_jobs": 1,
            "oob_score": True,
            "random_state": 1729,
        },
        target_transform_func=target_transform_func,
        target_transform_inverse_func=target_transform_inverse_func,
    )

    for cv_idx, (train_idx, test_idx) in enumerate(cv.split(X_trim, y_fit)):
        start = datetime.now()

        X_train, X_test = X_trim[train_idx], X_trim[test_idx]
        y_train, y_test = y_fit[train_idx], y_fit[test_idx]

        with joblib.parallel_backend("dask"):
            pipe_skopt.fit(X_train, y_train)

        cv_results[cv_idx] = {
            "pipeline": pipe_skopt,
            "train_idx": train_idx,
            "test_idx": test_idx,
            "y_pred": pipe_skopt.predict(X_test),
            "y_true": y_test,
            "test_mae": median_absolute_error(y_test, pipe_skopt.predict(X_test)),
            "train_mae": median_absolute_error(y_train, pipe_skopt.predict(X_train)),
            "test_r2": r2_score(y_test, pipe_skopt.predict(X_test)),
            "train_r2": r2_score(y_train, pipe_skopt.predict(X_train)),
        }
        
        if ((target_transform_func is not None)
            or (target_transform_inverse_func is not None)):
            cv_results[cv_idx]["coefs"] = [
                est.coef_ for est
                in pipe_skopt.named_steps["estimate"].regressor_.estimators_
            ]
            cv_results[cv_idx]["alpha"] = [
                est.alpha_ for est
                in pipe_skopt.named_steps["estimate"].regressor_.estimators_
            ]
            cv_results[cv_idx]["l1_ratio"] = [
                est.l1_ratio_ for est
                in pipe_skopt.named_steps["estimate"].regressor_.estimators_
            ]
        else:
            cv_results[cv_idx]["coefs"] = [
                est.coef_ for est
                in pipe_skopt.named_steps["estimate"].estimators_
            ]
            cv_results[cv_idx]["alpha"] = [
                est.alpha_ for est
                in pipe_skopt.named_steps["estimate"].estimators_
            ]
            cv_results[cv_idx]["l1_ratio"] = [
                est.l1_ratio_ for est
                in pipe_skopt.named_steps["estimate"].estimators_
            ]
        
        if ensembler is None:
            if ((target_transform_func is not None)
                or (target_transform_inverse_func is not None)):
                cv_results[cv_idx]["optimizer"] = pipe_skopt.named_steps["estimate"].regressor_.bayes_optimizer_                
            else:
                cv_results[cv_idx]["optimizer"] = pipe_skopt.named_steps["estimate"].bayes_optimizer_

        print(f"CV index [{cv_idx:3d}], Elapsed time: ", datetime.now() - start)
        
    return cv_results, y_fit

raises:

KeyError: 'param__alpha'

on this line:

Further debugging shows that the dict in question has a key param_alpha with only one underscore. Fix to follow.

Math in docs doesn't render

Add Bayes Search option to the CV estimators

Instead of using grid search, we should allow the user to ask for BayesSearchCV to optimize the hyperparameters.

Update python support

We should add 3.10 to the CI tests and drop 3.6 from setup.cfg, etc.

Visualize coefficient regularization paths

Migrating from yeatmanlab/AFQ-Insight#45

Use stratified CV splits for LogisticSGLCV

LogisticSGLCV uses the sklearn function check_cv but does not specify classifier=True. This will cause it to use KFold by default rather than StratifiedKFold.

Implement custom _path_residuals function

SGLCV currently relies on sklearn's _path_residuals but that only returns MSE. We'd like it to return both custom scorers and fitted coefs, as it does in the logistic case.

Tagging @mnarayan since this came up in conversation.

Add examples and doctests for the group transformers

Following #51, we should write an example demonstrating use of the group transformers. Actually, something with the Sarica data would be a nice example. We could add a function to the datasets module to download the Sarica data as well.

Does groupyr.LogisticSGL support multiclass classification and if yes how to interpret coef_?

Hi! First of all, thanks for this wonderful package! I have a dataset with three subject groups and wonder if I can use groupyr.LogisticSGL in this case. I couldn't find any documentation if groupyr.LogisticSGL only supports binary classification. It runs through, so I guess multiclass classification is supported? And if yes, how would I interpret the (n_features,) coef_ attribute? Is groupyr.LogisticSGL implicitly running a one-vs-rest classification in the background?

predict_proba method returns array of size (n_samples, 2) regardless of the number of classes

Describe the bug

Using the predict_proba method produces an array inconsistent with the number of classes, and always produces an array of size n_samples, 2

Steps/Code to Reproduce

`from groupyr.datasets import make_group_classification
from groupyr import LogisticSGL

X, y, groups = make_group_classification(n_samples = 500, n_classes = 8, random_state = 0)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

model = LogisticSGL(groups = groups)
model.fit(X_train, y_train)
model.predict_proba(X_test)`

Expected Results

I expected an array of size 50, 8, but got one of size 50, 2

Actual Results

array([[0.00000000e+00, 1.00000000e+00],
[5.68755714e-01, 4.31244286e-01],
[2.83550960e-13, 1.00000000e+00],
[8.64863736e-13, 1.00000000e+00],
[4.52056170e-11, 1.00000000e+00],
[0.00000000e+00, 1.00000000e+00],
[2.44249065e-15, 1.00000000e+00],
[1.98451813e-06, 9.99998015e-01],
[4.29922467e-05, 9.99957008e-01],
[0.00000000e+00, 1.00000000e+00],
[1.96334715e-07, 9.99999804e-01],
[7.24888152e-06, 9.99992751e-01],
[9.99999571e-01, 4.29150334e-07],
[7.86086434e-04, 9.99213914e-01],
[2.69039440e-01, 7.30960560e-01],
[9.99999302e-01, 6.97585749e-07],
[8.18352053e-11, 1.00000000e+00],
[3.08642001e-12, 1.00000000e+00],
[8.88178420e-16, 1.00000000e+00],
[6.08360810e-01, 3.91639190e-01],
[1.30118138e-13, 1.00000000e+00],
[6.21724894e-15, 1.00000000e+00],
[1.00000000e+00, 2.93300056e-12],
[4.54094465e-08, 9.99999955e-01],
[4.25080449e-04, 9.99574920e-01],
[8.67024220e-01, 1.32975780e-01],
[9.99974998e-01, 2.50021319e-05],
[7.29032390e-11, 1.00000000e+00],
[3.45561157e-09, 9.99999997e-01],
[2.72104561e-11, 1.00000000e+00],
[0.00000000e+00, 1.00000000e+00],
[4.69719716e-08, 9.99999953e-01],
[9.85393106e-07, 9.99999015e-01],
[0.00000000e+00, 1.00000000e+00],
[6.70574707e-14, 1.00000000e+00],
[3.20425858e-07, 9.99999680e-01],
[9.99990235e-01, 9.76457155e-06],
[2.53925407e-02, 9.74607459e-01],
[9.96085916e-01, 3.91408362e-03],
[5.13633580e-12, 1.00000000e+00],
[9.99992114e-01, 7.88617592e-06],
[4.44089210e-16, 1.00000000e+00],
[1.47470647e-03, 9.98525294e-01],
[4.82376618e-02, 9.51762338e-01],
[1.15463195e-14, 1.00000000e+00],
[1.34734691e-06, 9.99998653e-01],
[8.27309332e-11, 1.00000000e+00],
[1.10023102e-12, 1.00000000e+00],
[1.33504319e-11, 1.00000000e+00],
[6.41155066e-08, 9.99999936e-01]])

Versions

import groupyr as gpr
print(gpr.version)

0.2.7

SGLCV fails with too few observations for CV

Describe the bug

When there are too few observations for the CV, SGLCV fails with an uninformative UnboundLocalError. This happens with groupyr 0.2.6, if I recall correctly I didn't have the problem with 0.2.4

Steps/Code to Reproduce

import numpy as np
import groupyr as gr

y = np.array([8.35686197e-01, 7.79143707e-01, 9.68885893e-01, 6.00364059e-01,
 8.90818433e-01, 4.50071502e-01, 5.50324868e-04, 3.23702083e-01,
 3.26413651e-01])
X = np.array([[0.95834536, 0.24640152, 0.91383425, 0.36952137],
 [0.18028435, 0.34682591, 0.43773007, 0.7074315],
 [0.54305304, 0.55150522,0.03017366, 0.07321698],
 [0.49662785, 0.17114838, 0.61342598, 0.15094963],
 [0.66625233, 0.38015984, 0.51422898, 0.66124242],
 [0.95193769, 0.10298654, 0.03773045, 0.21904723],
 [0.34889582, 0.04983091, 0.13862843, 0.23390294],
 [0.05570983, 0.65507907, 0.74365214, 0.99539654],
 [0.01563651, 0.75173544, 0.56747472, 0.31385082]]
)
l1_ratio = 0.0008299164840661392
groups = [np.array([0, 1]), np.array([2, 3])]

model = gr.SGLCV(
            l1_ratio=l1_ratio,
            groups=groups,
            scale_l2_by="group_length",
            cv=5,
            random_state=1234
        ).fit(X=X, y=y)

Expected Results

A clear error message why it didn't work.

Actual Results

/path/to/lib/python3.8/site-packages/sklearn/metrics/_regression.py:796: UndefinedMetricWarning: R^2 score is not well-defined with less than two samples.
  warnings.warn(msg, UndefinedMetricWarning)
[the UndefinedMetricWarning is repeated several times]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/path/to/lib/python3.8/site-packages/groupyr/sgl.py", line 1120, in fit
    self.l1_ratio_ = best_l1_ratio
UnboundLocalError: local variable 'best_l1_ratio' referenced before assignment

However, if I use less folds, it works:

model = gr.SGLCV(
            l1_ratio=l1_ratio,
            groups=groups,
            scale_l2_by="group_length",
            cv=3,
            random_state=1234
        ).fit(X=X, y=y)

Comment

I think the error is because one fold only has 1 observation which I guess leads to a wrong R^2 metric and later on to some uncaught errors in the groupyr code. I'm not well versed with scikit-learn, so I don't know if a fix would be better in the scikit-learn code or in groupyr. However, it would be nice to get an informative error message instead of an error due to groupyr internals.

Versions

groupyr 0.2.6
scikit-learn 1.0.2
scikit-optimize 0.9.0

`make_group_regression` does not shuffle the groups

We should shuffle the groups as well

Fix coverage upload to coveralls.io

coveralls.io reports 0% coverage but the coverage reports in the GitHub action report 88%.

[Joss review] Installing groupyr installs unnecessary dependencies

Describe the bug

Hi, as part of the JOSS review I'll open a few issues. If you'd prefer, I can compress them all into one issue instead.

Steps/Code to Reproduce

Installing the latest groupyr version on pypi also installs ipywidgets (and the large list of its dependencies). As far as I can tell, that's not required for the library (searching for ipywidgets in the source code doesn't yield any results.

Expected Results

Only dependencies required for groupyr functionality to be installed.

Versions

0.2.0

Migrate transformers from AFQ-Insight

AFQ-Insight has a few functions and transformer classes to select, remove, manipulate individual groups or subsets of groups in a feature matrix. There is nothing tractometry specific about this so it should go in groupyr.

We can capture all of the functionality in afqinsight.transform by adding the following transfomer classes

GroupExtractor: to select and return certain groups from the feature matrix
GroupRemover: to remove certain groups from the feature matrix
GroupShuffler: to shuffle certain groups within a feature matrix without touching the other groups
~~GroupConcatenator: to concatenate feature matrices with different groups~~

@arokem, @mnarayan: your thoughts?

All transformers and PCA classes should provide a get_feature_names method

Describe the workflow you want to enable

Many sklearn transformers provide a get_feature_names() method. And things like FeatureUnion rely on those underlying methods to generate downstream feature names. All of the groupyr transforms should provide this method.

Describe your proposed solution

Many of the methods already compute something like a feature_names_out_ attribute. We should simply add a get_feature_names() method to return this.

Installing scikit-learn has changed: Has now to be scikit-learn, not sklearn

Hi! Your package is part of a .yml file that I created to create a conda environment. I am not 100% sure but I think groupyr is responsible for the error I get:

Installing scikit-learn using pip has recently changed from pip install sklearn to pip install scikit-learn, so I think your requirements file has to be changed?

Deprecation warning

When using groupyr.SGLCV I get:

/home/johannes.wiesner/.conda/envs/csp_wiesner_johannes/lib/python3.9/site-packages/copt/utils.py:41: DeprecationWarning:

Please use MemoizeJac from the scipy.optimize namespace, the scipy.optimize.optimize namespace is deprecated.

Expose scoring to SGLCV and LogisticSGLCV

Describe the workflow you want to enable

sgl_scoring_path and logistic_sgl_scoring_path have a scoring parameter to allow the user to specify an alternative scoring metric. But SGLCV and LogisticSGLCV do not expose this parameter to the user. We should fix that.

Optimizing over a matrix of parameters.

For the least square problem $argmin_B | F - BX|_2^2 +\lambda \sum_l{B_l}$, where $F,X \in \mathbb{R}^{n \times k}, B in \mathbb{R}^{n\times n}, B_l \in \mathbb{R}^{n,n_l}$, I used to break it into $n$ different equations. If $n$ is high that would take too much time to solve. It would be better if the frameowork supports optimizing over the matrix $B$ at once. If it is doable based on the current work, I am happy to contribute but is there something I need to know before doing it?

Add sklearn>=0.24.0 support

Once this PR is merged, we will be able to loosen the dependency requirements to allow sklearn>=0.24.0 and scipy>=1.6.0.

This issue serves as a reminder to do that and update tox.ini accordingly

nrdg / groupyr Goto Github PK

groupyr's People

Contributors

Stargazers

Watchers

Forkers

groupyr's Issues

Describe the workflow you want to enable

Describe your proposed solution

Describe the issue linked to the documentation

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Comment

Versions

Describe the bug

Steps/Code to Reproduce

Expected Results

Versions

Describe the workflow you want to enable

Describe your proposed solution

Describe the workflow you want to enable

Recommend Projects

Recommend Topics

Recommend Org