Coder Social home page Coder Social logo

nrdg / groupyr Goto Github PK

View Code? Open in Web Editor NEW
20.0 20.0 6.0 9.31 MB

groupyr: Sparse Group Lasso in Python

Home Page: https://richford.github.io/groupyr

License: BSD 3-Clause "New" or "Revised" License

Shell 1.17% Python 96.03% Makefile 0.56% TeX 2.24%

groupyr's People

Contributors

arokem avatar kthyng avatar richford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

groupyr's Issues

sklearn.linear_model._coordinate_descent._alpha_grid does not have a normalize argument anymore

When running:

estimator = SGLCV(groups=groups
                  n_jobs=n_jobs,
                  random_state=rng,
                  suppress_solver_warnings=False,
                  l1_ratio=1,
                  normalize=True
                  )

estimator.fit(X,y)

I get the following error:

TypeError: _alpha_grid() got an unexpected keyword argument 'normalize'

I search for recent commits in sklearn, and with this commit, sklearn.linear_model._coordinate_descent._alpha_grid does not accept the normalize parameter anymore.

Allow transformers to return the intersection of groups

Describe the workflow you want to enable

GroupShuffler, GroupRemover and GroupExtractor return the union of group labels if a sequence is passed for the select parameter. The user should be able to get the intersection of group labels also.

Describe your proposed solution

Add a select_intersection kwarg to all of the above transformers.

Add usage guide to documentation

  • Add brief motivation about why one might want to use SGL.
  • Talk about specification of the groups parameter.
  • Talk about the SGL class.
  • Talk about the LogisticSGL class.
  • Talk about the CV estimators.
  • Talk about the dataset generators.
  • Make liberal use of links to the sklearn documentation.

Add FAQ page to documentation

Questions:

  • Why groupyr? There are a few other implementations of SGL out there. Why did we bother making another?
    Answer should highlight the performance differences and additional CV capabilities. Also hint at future penalties like Fused SGL.
  • How do we pronounce groupyr? It's like the fish, "grouper"

What else?

Accept more formats for the `groups` parameter

The common groups parameter currently accepts the explicit standard group format (i.e. a list of numpy arrays where each array represents a group and each element of the array represents a feature index that belongs to that group) or None, in which case all features are assigned to one group.

We should also allow users to input a single integer s, in which case we will create contiguous groups of size s until all features are accounted for.

@arokem, @mnarayan Any other input behaviors we'd like to try?

Extracting the regularisation path and coefficients along it

A neat feature of the LASSO and its software implementation is the ability to extract the regularisation path, including the coefficients along the way. It appears that sgl_path() offers such functionality, but it is either not exposed to the user or its usage is not described in the docs. Would it be possible to add an example of how to extract coefficients along the regularisation path?

Add more examples to documentation

We should add more examples to the documentation. Some ideas:

  • An example plotting the regularization path of SGL
  • An example with SGL (no CV)
  • An example with LogisticSGL (no CV)
  • An example using SGLCV in a pipeline
  • An example using SGL as a transformer in a pipeline to another model

KeyError accessing cv_results_ dict

Using the following function (inspired by this notebook)

def get_cv_results(n_repeats=5, n_splits=10,
                   shuffle=False,
                   ensembler=None,
                   target_transform_func=None,
                   target_transform_inverse_func=None,
                   n_estimators=10,
                   trim_nodes=0,
                   square_features=False):
    if shuffle:
        rng = np.random.default_rng()
        y_fit = rng.permutation(y)
    else:
        y_fit = np.copy(y)
        
    if trim_nodes > 0:
        grp_mask = np.zeros_like(groups[0], dtype=bool)
        grp_mask[trim_nodes:-trim_nodes] = True
        X_mask = np.concatenate([grp_mask] * len(groups))

        groups_trim = []
        start_idx = 0
        
        for grp in groups:
            stop_idx = start_idx + len(grp) - 2 * trim_nodes
            groups_trim.append(np.arange(start_idx, stop_idx))
            start_idx += len(grp) - 2 * trim_nodes
            
        X_trim = X[:, X_mask]
    elif trim_nodes == 0:
        groups_trim = [grp for grp in groups]
        X_trim = np.copy(X)
    else:
        raise ValueError("trim_nodes must be non-negative.")
        
    if square_features:
        _n_samples, _n_features = X_trim.shape
        X_trim = np.hstack([X_trim, np.square(X_trim)])
        groups_trim = [np.concatenate([g, g + _n_features]) for g in groups_trim]
    
    cv = RepeatedKFold(
        n_splits=n_splits,
        n_repeats=n_repeats,
        random_state=1729
    )

    cv_results = {}
    pipe_skopt = afqi.make_afq_regressor_pipeline(
        imputer_kwargs={"strategy": "median"},
        use_cv_estimator=True,
        scaler="standard",
        groups=groups_trim,
        verbose=0,
        pipeline_verbosity=False,
        tuning_strategy="bayes",
        cv=3,
        n_bayes_points=9,
        n_jobs=28,
        l1_ratio=[0.0, 1.0],
        eps=5e-2,
        n_alphas=100,
        ensemble_meta_estimator=ensembler,
        ensemble_meta_estimator_kwargs={
            "n_estimators": n_estimators,
            "n_jobs": 1,
            "oob_score": True,
            "random_state": 1729,
        },
        target_transform_func=target_transform_func,
        target_transform_inverse_func=target_transform_inverse_func,
    )

    for cv_idx, (train_idx, test_idx) in enumerate(cv.split(X_trim, y_fit)):
        start = datetime.now()

        X_train, X_test = X_trim[train_idx], X_trim[test_idx]
        y_train, y_test = y_fit[train_idx], y_fit[test_idx]

        with joblib.parallel_backend("dask"):
            pipe_skopt.fit(X_train, y_train)

        cv_results[cv_idx] = {
            "pipeline": pipe_skopt,
            "train_idx": train_idx,
            "test_idx": test_idx,
            "y_pred": pipe_skopt.predict(X_test),
            "y_true": y_test,
            "test_mae": median_absolute_error(y_test, pipe_skopt.predict(X_test)),
            "train_mae": median_absolute_error(y_train, pipe_skopt.predict(X_train)),
            "test_r2": r2_score(y_test, pipe_skopt.predict(X_test)),
            "train_r2": r2_score(y_train, pipe_skopt.predict(X_train)),
        }
        
        if ((target_transform_func is not None)
            or (target_transform_inverse_func is not None)):
            cv_results[cv_idx]["coefs"] = [
                est.coef_ for est
                in pipe_skopt.named_steps["estimate"].regressor_.estimators_
            ]
            cv_results[cv_idx]["alpha"] = [
                est.alpha_ for est
                in pipe_skopt.named_steps["estimate"].regressor_.estimators_
            ]
            cv_results[cv_idx]["l1_ratio"] = [
                est.l1_ratio_ for est
                in pipe_skopt.named_steps["estimate"].regressor_.estimators_
            ]
        else:
            cv_results[cv_idx]["coefs"] = [
                est.coef_ for est
                in pipe_skopt.named_steps["estimate"].estimators_
            ]
            cv_results[cv_idx]["alpha"] = [
                est.alpha_ for est
                in pipe_skopt.named_steps["estimate"].estimators_
            ]
            cv_results[cv_idx]["l1_ratio"] = [
                est.l1_ratio_ for est
                in pipe_skopt.named_steps["estimate"].estimators_
            ]
        
        if ensembler is None:
            if ((target_transform_func is not None)
                or (target_transform_inverse_func is not None)):
                cv_results[cv_idx]["optimizer"] = pipe_skopt.named_steps["estimate"].regressor_.bayes_optimizer_                
            else:
                cv_results[cv_idx]["optimizer"] = pipe_skopt.named_steps["estimate"].bayes_optimizer_

        print(f"CV index [{cv_idx:3d}], Elapsed time: ", datetime.now() - start)
        
    return cv_results, y_fit

raises:

KeyError: 'param__alpha'

on this line:

Further debugging shows that the dict in question has a key param_alpha with only one underscore. Fix to follow.

Implement custom _path_residuals function

SGLCV currently relies on sklearn's _path_residuals but that only returns MSE. We'd like it to return both custom scorers and fitted coefs, as it does in the logistic case.

Tagging @mnarayan since this came up in conversation.

Add examples and doctests for the group transformers

Following #51, we should write an example demonstrating use of the group transformers. Actually, something with the Sarica data would be a nice example. We could add a function to the datasets module to download the Sarica data as well.

Does groupyr.LogisticSGL support multiclass classification and if yes how to interpret coef_?

Hi! First of all, thanks for this wonderful package! I have a dataset with three subject groups and wonder if I can use groupyr.LogisticSGL in this case. I couldn't find any documentation if groupyr.LogisticSGL only supports binary classification. It runs through, so I guess multiclass classification is supported? And if yes, how would I interpret the (n_features,) coef_ attribute? Is groupyr.LogisticSGL implicitly running a one-vs-rest classification in the background?

predict_proba method returns array of size (n_samples, 2) regardless of the number of classes

Describe the bug

Using the predict_proba method produces an array inconsistent with the number of classes, and always produces an array of size n_samples, 2

Steps/Code to Reproduce

`from groupyr.datasets import make_group_classification
from groupyr import LogisticSGL

X, y, groups = make_group_classification(n_samples = 500, n_classes = 8, random_state = 0)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

model = LogisticSGL(groups = groups)
model.fit(X_train, y_train)
model.predict_proba(X_test)`

Expected Results

I expected an array of size 50, 8, but got one of size 50, 2

Actual Results

array([[0.00000000e+00, 1.00000000e+00],
[5.68755714e-01, 4.31244286e-01],
[2.83550960e-13, 1.00000000e+00],
[8.64863736e-13, 1.00000000e+00],
[4.52056170e-11, 1.00000000e+00],
[0.00000000e+00, 1.00000000e+00],
[2.44249065e-15, 1.00000000e+00],
[1.98451813e-06, 9.99998015e-01],
[4.29922467e-05, 9.99957008e-01],
[0.00000000e+00, 1.00000000e+00],
[1.96334715e-07, 9.99999804e-01],
[7.24888152e-06, 9.99992751e-01],
[9.99999571e-01, 4.29150334e-07],
[7.86086434e-04, 9.99213914e-01],
[2.69039440e-01, 7.30960560e-01],
[9.99999302e-01, 6.97585749e-07],
[8.18352053e-11, 1.00000000e+00],
[3.08642001e-12, 1.00000000e+00],
[8.88178420e-16, 1.00000000e+00],
[6.08360810e-01, 3.91639190e-01],
[1.30118138e-13, 1.00000000e+00],
[6.21724894e-15, 1.00000000e+00],
[1.00000000e+00, 2.93300056e-12],
[4.54094465e-08, 9.99999955e-01],
[4.25080449e-04, 9.99574920e-01],
[8.67024220e-01, 1.32975780e-01],
[9.99974998e-01, 2.50021319e-05],
[7.29032390e-11, 1.00000000e+00],
[3.45561157e-09, 9.99999997e-01],
[2.72104561e-11, 1.00000000e+00],
[0.00000000e+00, 1.00000000e+00],
[4.69719716e-08, 9.99999953e-01],
[9.85393106e-07, 9.99999015e-01],
[0.00000000e+00, 1.00000000e+00],
[6.70574707e-14, 1.00000000e+00],
[3.20425858e-07, 9.99999680e-01],
[9.99990235e-01, 9.76457155e-06],
[2.53925407e-02, 9.74607459e-01],
[9.96085916e-01, 3.91408362e-03],
[5.13633580e-12, 1.00000000e+00],
[9.99992114e-01, 7.88617592e-06],
[4.44089210e-16, 1.00000000e+00],
[1.47470647e-03, 9.98525294e-01],
[4.82376618e-02, 9.51762338e-01],
[1.15463195e-14, 1.00000000e+00],
[1.34734691e-06, 9.99998653e-01],
[8.27309332e-11, 1.00000000e+00],
[1.10023102e-12, 1.00000000e+00],
[1.33504319e-11, 1.00000000e+00],
[6.41155066e-08, 9.99999936e-01]])

Versions

import groupyr as gpr
print(gpr.version)

0.2.7

SGLCV fails with too few observations for CV

Describe the bug

When there are too few observations for the CV, SGLCV fails with an uninformative UnboundLocalError. This happens with groupyr 0.2.6, if I recall correctly I didn't have the problem with 0.2.4

Steps/Code to Reproduce

import numpy as np
import groupyr as gr

y = np.array([8.35686197e-01, 7.79143707e-01, 9.68885893e-01, 6.00364059e-01,
 8.90818433e-01, 4.50071502e-01, 5.50324868e-04, 3.23702083e-01,
 3.26413651e-01])
X = np.array([[0.95834536, 0.24640152, 0.91383425, 0.36952137],
 [0.18028435, 0.34682591, 0.43773007, 0.7074315],
 [0.54305304, 0.55150522,0.03017366, 0.07321698],
 [0.49662785, 0.17114838, 0.61342598, 0.15094963],
 [0.66625233, 0.38015984, 0.51422898, 0.66124242],
 [0.95193769, 0.10298654, 0.03773045, 0.21904723],
 [0.34889582, 0.04983091, 0.13862843, 0.23390294],
 [0.05570983, 0.65507907, 0.74365214, 0.99539654],
 [0.01563651, 0.75173544, 0.56747472, 0.31385082]]
)
l1_ratio = 0.0008299164840661392
groups = [np.array([0, 1]), np.array([2, 3])]

model = gr.SGLCV(
            l1_ratio=l1_ratio,
            groups=groups,
            scale_l2_by="group_length",
            cv=5,
            random_state=1234
        ).fit(X=X, y=y)

Expected Results

A clear error message why it didn't work.

Actual Results

/path/to/lib/python3.8/site-packages/sklearn/metrics/_regression.py:796: UndefinedMetricWarning: R^2 score is not well-defined with less than two samples.
  warnings.warn(msg, UndefinedMetricWarning)
[the UndefinedMetricWarning is repeated several times]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/path/to/lib/python3.8/site-packages/groupyr/sgl.py", line 1120, in fit
    self.l1_ratio_ = best_l1_ratio
UnboundLocalError: local variable 'best_l1_ratio' referenced before assignment

However, if I use less folds, it works:

model = gr.SGLCV(
            l1_ratio=l1_ratio,
            groups=groups,
            scale_l2_by="group_length",
            cv=3,
            random_state=1234
        ).fit(X=X, y=y)

Comment

I think the error is because one fold only has 1 observation which I guess leads to a wrong R^2 metric and later on to some uncaught errors in the groupyr code. I'm not well versed with scikit-learn, so I don't know if a fix would be better in the scikit-learn code or in groupyr. However, it would be nice to get an informative error message instead of an error due to groupyr internals.

Versions

groupyr 0.2.6
scikit-learn 1.0.2
scikit-optimize 0.9.0

[Joss review] Installing groupyr installs unnecessary dependencies

Describe the bug

Hi, as part of the JOSS review I'll open a few issues. If you'd prefer, I can compress them all into one issue instead.

Steps/Code to Reproduce

Installing the latest groupyr version on pypi also installs ipywidgets (and the large list of its dependencies). As far as I can tell, that's not required for the library (searching for ipywidgets in the source code doesn't yield any results.

Expected Results

Only dependencies required for groupyr functionality to be installed.

Versions

0.2.0

Migrate transformers from AFQ-Insight

AFQ-Insight has a few functions and transformer classes to select, remove, manipulate individual groups or subsets of groups in a feature matrix. There is nothing tractometry specific about this so it should go in groupyr.

We can capture all of the functionality in afqinsight.transform by adding the following transfomer classes

  • GroupExtractor: to select and return certain groups from the feature matrix
  • GroupRemover: to remove certain groups from the feature matrix
  • GroupShuffler: to shuffle certain groups within a feature matrix without touching the other groups
  • GroupConcatenator: to concatenate feature matrices with different groups

@arokem, @mnarayan: your thoughts?

All transformers and PCA classes should provide a get_feature_names method

Describe the workflow you want to enable

Many sklearn transformers provide a get_feature_names() method. And things like FeatureUnion rely on those underlying methods to generate downstream feature names. All of the groupyr transforms should provide this method.

Describe your proposed solution

Many of the methods already compute something like a feature_names_out_ attribute. We should simply add a get_feature_names() method to return this.

Installing scikit-learn has changed: Has now to be scikit-learn, not sklearn

Hi! Your package is part of a .yml file that I created to create a conda environment. I am not 100% sure but I think groupyr is responsible for the error I get:

Screenshot_20230929_155025

Installing scikit-learn using pip has recently changed from pip install sklearn to pip install scikit-learn, so I think your requirements file has to be changed?

Deprecation warning

When using groupyr.SGLCV I get:

/home/johannes.wiesner/.conda/envs/csp_wiesner_johannes/lib/python3.9/site-packages/copt/utils.py:41: DeprecationWarning:

Please use MemoizeJac from the scipy.optimize namespace, the scipy.optimize.optimize namespace is deprecated.

Expose scoring to SGLCV and LogisticSGLCV

Describe the workflow you want to enable

sgl_scoring_path and logistic_sgl_scoring_path have a scoring parameter to allow the user to specify an alternative scoring metric. But SGLCV and LogisticSGLCV do not expose this parameter to the user. We should fix that.

Optimizing over a matrix of parameters.

For the least square problem $argmin_B | F - BX|_2^2 +\lambda \sum_l{B_l}$, where $F,X \in \mathbb{R}^{n \times k}, B in \mathbb{R}^{n\times n}, B_l \in \mathbb{R}^{n,n_l}$, I used to break it into $n$ different equations. If $n$ is high that would take too much time to solve. It would be better if the frameowork supports optimizing over the matrix $B$ at once. If it is doable based on the current work, I am happy to contribute but is there something I need to know before doing it?

Add sklearn>=0.24.0 support

Once this PR is merged, we will be able to loosen the dependency requirements to allow sklearn>=0.24.0 and scipy>=1.6.0.

This issue serves as a reminder to do that and update tox.ini accordingly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.