Coder Social home page Coder Social logo

yu-group / veridical-flow Goto Github PK

View Code? Open in Web Editor NEW
67.0 67.0 6.0 13.76 MB

Making it easier to build stable, trustworthy data-science pipelines based on the PCS framework.

Home Page: https://vflow.csinva.io

License: MIT License

Python 22.53% Jupyter Notebook 77.38% Makefile 0.09%
ai data-science ensembling machine-learning ml pandas preprocessing python3 stability statistics tutorial workflow

veridical-flow's People

Contributors

aagarwal1996 avatar csinva avatar danielskatz avatar jbytecode avatar jpdunc23 avatar kmichael08 avatar matthewfeickert avatar rushk014 avatar ssaxena00 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

veridical-flow's Issues

Build_vset not building vfuncs properly

Call to build_vset is failing to set the vfuncs attribute of Vset correctly. It is not combining the parameters in param_dict accurately.

param_dict = {
    'n_estimators': [100, 200, 300],
    'min_samples_split': [2, 10],  # default value comes first
    'max_features': ['sqrt', 'log2']
}
rf_set = build_vset('RF', RandomForestRegressor, param_dict, criterion = 'absolute_error')
assert len(rf_set.modules) == 3*2*2

Returns

AssertionError                            Traceback (most recent call last)
Input In [2], in <module>
      1 param_dict = {
      2     'n_estimators': [100, 200, 300],
      3     'min_samples_split': [2, 10],  # default value comes first
      4     'max_features': ['sqrt', 'log2']
      5 }
      6 rf_set = build_vset('RF', RandomForestRegressor, param_dict, criterion = 'absolute_error')
----> 7 assert len(rf_set.modules) == 3*2*2

Pass list of Callables and list of param_dicts to build_vset

Hello again.

I'm having trouble passing a list of Callables and list of param_dicts to build_vset. The following error occurs: obj must be callable.
I'm passing a list of sklearn models and a corresponding list of parameter dicts. According to the documentation, this should work.

[JOSS Review] Example Notebooks

(as part of: openjournals/joss-reviews#3895)

I noticed some issues with the example notebooks that I summarized here:

  • In general, it would be very beneficial to have a bit more explanation in the example notebooks in order to better understand the cool features and advantages of VeridicalFlow.

  • The 00_synthetic_classification.ipynb notebook throws the following Exception when trying to execute it: module 'sklearn' has no attribute 'datasets'. My suggestion would be to add the line import sklearn.datasets. Furthermore, the function sklearn.datasets.load_boston is deprecated:

    'load_boston' is deprecated in 1.0 and will be removed in 1.2.  
    

    Maybe you want to consider using another dataset as an example in order to make it safe for future use?

  • Another general improvement suggestion would be to directly include example the notebooks into documentation instead of linking to the notebooks in the GitHub repo.

    Your current solution implies that the example notebooks always have to be fully executed before committing to the repository to include the output in the notebook for. However, this has some drawbacks:

    • In general, it's discouraged to commit notebooks with their output since every execution would change the notebook file
    • And, of course, you would always need to remember to really run the notebooks before committing.

    For building docs with sphinx this could be done using nbsphinx, To be honest, I don't know how it works with pdoc, but I'm sure similar solutions exist?

[JOSS Review] Code

(as part of: openjournals/joss-reviews#3895)

When looking at the code there are some points you might want to address which would make usage of the package even better:

  • For some functions and classes, docstrings are missing.
  • In general, the docstring formatting is inconsistent; some docstrings start with small letters, some with capital letters, some have a Short Summary at the beginning, some don't.
  • To easily detect those issues and to further improve code quality I would recommend using a code analysis tool such as prospector, which includes pylint for linting and pep8 for checking against style conventions. This could easily be integrated into your existing workflow and also works well with GitHub Actions. This would also make it easier for contributors to get up to speed quickly with VeridicalFlow.

prediction_uncertainty breaks subkey matching

prediction_uncertainty breaks subkey matching

Currently prediction_uncertainty uses dict_to_df to perform aggregation and then recreates Subkeys.
As a result the recreated Subkeys lose their prior _output_matching and _sep_dicts_id information. This can break pipelines that rely on mean_dict/std_dict for later parts of the pipeline.

Reproducible Example

X, y = sklearn.datasets.make_classification(n_samples=100, n_features=5)
X_train, X_test, y_train, y_test = init_args(train_test_split(X, y), names=['xtr', 'xte', 'ytr', 'yte'])

subsampling_funcs = [partial(sklearn.utils.resample, n_samples=80, random_state=i) for i in range(5)]
subsampling_set = Vset(name='subsample', modules=subsampling_funcs, output_matching=True)
X_trains, y_trains = subsampling_set(X_train, y_train)

subsampling_set_test = Vset(name='subsample_test', modules=subsampling_funcs, output_matching=True)
X_tests, y_tests = subsampling_set_test(X_test, y_test)

models = [LogisticRegression(max_iter=1000, tol=0.1), DecisionTreeClassifier()]
modeling_set = Vset(name='model', modules=models, module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)

# clamp mean predictions over test-set subsamples
mean_dict, std_dict, pred_stats_df = modeling_set.predict(X_tests, with_uncertainty=True, group_by=['subsample_test'])
mean_dict = {k: np.round(v) if k != PREV_KEY else v for k, v in mean_dict.items()}

failed_metrics = binary_metrics_set.evaluate(mean_dict, y_tests)
failed_metrics

failed_metrics should have matched mean_dict keys of the form (subsample_test_0, ) with y_tests keys of the form (yte, subsample_test_0), but cannot because the Subkeys have different _sep_dicts_id due to recreation.

Expected Output

{(subsample_test_0, yte, Acc): 0.925,
 (subsample_test_1, yte, Acc): 0.9125,
 (subsample_test_2, yte, Acc): 0.8875,
 (subsample_test_3, yte, Acc): 0.9,
 (subsample_test_4, yte, Acc): 0.9,
 (subsample_test_0, yte, Bal_Acc): 0.9268292682926829,
 (subsample_test_1, yte, Bal_Acc): 0.9,
 (subsample_test_2, yte, Bal_Acc): 0.8902439024390244,
 (subsample_test_3, yte, Bal_Acc): 0.9024390243902439,
 (subsample_test_4, yte, Bal_Acc): 0.9069767441860466, ...}

Actual Output

{(subsample_test_0, yte, subsample_test_0, Acc): 0.9125,
 (subsample_test_0, yte, subsample_test_1, Acc): 0.5875,
 (subsample_test_0, yte, subsample_test_2, Acc): 0.4875,
 (subsample_test_0, yte, subsample_test_3, Acc): 0.475,
 (subsample_test_0, yte, subsample_test_4, Acc): 0.6,
 (subsample_test_1, yte, subsample_test_0, Acc): 0.625,
 (subsample_test_1, yte, subsample_test_1, Acc): 0.925,
 (subsample_test_1, yte, subsample_test_2, Acc): 0.525,
 (subsample_test_1, yte, subsample_test_3, Acc): 0.4375,
 (subsample_test_1, yte, subsample_test_4, Acc): 0.5375, ... }

perturbation_stats unclear on mismatched Subkeys

It is unclear how perturbation_stats should handle multiple Subkeys with the same origin (thus the same column name in df).
Currently attempting to group on a duplicated column throws ValueError: Grouper for 'subsample' not 1-dimensional.

The illustrative example of this issue comes if we take the exact example pipeline from #35 and attempt to use a single subsample Vset with output_matching=False (so the X_trains/X_tests will match properly) instead of the two. Now if we want to predict with uncertainty over subsamples, it is unclear what this means. I think there are 2 cases:

  • My initial thought we could implement a way to distinguish identical mismatched Subkeys (maybe by appending -i)
  • Alternatively/additionally we could try to support multidimensional grouping in perturbation_stats

Illustrative Example

X, y = sklearn.datasets.make_classification(n_samples=100, n_features=5)
X_train, X_test, y_train, y_test = init_args(train_test_split(X, y), names=['xtr', 'xte', 'ytr', 'yte'])

subsampling_funcs = [partial(sklearn.utils.resample, n_samples=80, random_state=i) for i in range(5)]
subsampling_set = Vset(name='subsample', modules=subsampling_funcs)
X_trains, y_trains = subsampling_set(X_train, y_train)
X_tests, y_tests = subsampling_set(X_test, y_test)

models = [LogisticRegression(max_iter=1000, tol=0.1), DecisionTreeClassifier()]
modeling_set = Vset(name='model', modules=models, module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)

# clamp mean predictions over test-set subsamples
mean_dict, std_dict, pred_stats_df = modeling_set.predict(X_tests, with_uncertainty=True, group_by=['subsample'])
mean_dict = {k: np.round(v) if k != PREV_KEY else v for k, v in mean_dict.items()}

Test xfails

As far as I understand xfail tests, it is supposed to be a temporary flag rather than a test for a failing case.
Could you explain why are you using it here?

# this test is expected to fail because combine_dicts() makes the

Error when calling `fit_transform` for `Vset` with `is_async=True`

In the example below, when using a Vset with is_async=True, the transform method expects to get a ray.objectRef and call ray.get on it, but instead gets an array:

from vflow import build_vset, init_args

import numpy as np

from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

import ray

ray.init(num_cpus=4)

X, y = make_regression(n_samples=1000, n_features=100, n_informative=1)

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)

X_train, y_train = init_args([X_train, y_train], names=['X_train', 'y_train'])
X_val, y_val = init_args([X_val, y_val], names=['X_val', 'y_val'])

# create a Vset for bootstrapping from data 10 times
# we use lazy=True so that the data will not be resampled until needed
boot_set = build_vset('boot', resample, reps=10, lazy=True)

# bootstrap from training data by calling boot_fun
X_trains, y_trains = boot_set(X_train, y_train)

# hyperparameters to try
pca_params = {
    'n_components': [10, 20, 50],
    'svd_solver': ['randomized', 'full', 'auto']
}

# we could instead pass a list of distinct models and corresponding param dicts
pca_set = build_vset('PCA', PCA, pca_params, is_async=True)

X_trains_pca = pca_set.fit_transform(X_trains)
TypeError: Attempting to call `get` on the value [[-0.73763296 -1.64044139 -0.74793088 ... -0.1085027  -0.25652127
   0.11583096]
...

See #50 for a possible workaround until this is fixed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.