yu-group / veridical-flow Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 6.0 13.76 MB

Making it easier to build stable, trustworthy data-science pipelines based on the PCS framework.

Home Page: https://vflow.csinva.io

License: MIT License

Python 22.53% Jupyter Notebook 77.38% Makefile 0.09%

ai data-science ensembling machine-learning ml pandas preprocessing python3 stability statistics tutorial workflow

veridical-flow's People

Contributors

Stargazers

Watchers

Forkers

kmichael08 cryptowealth-technology liangcao2018 neuralnetninja1 matthewfeickert lsveracool

veridical-flow's Issues

Build_vset not building vfuncs properly

Call to build_vset is failing to set the vfuncs attribute of Vset correctly. It is not combining the parameters in param_dict accurately.

param_dict = {
    'n_estimators': [100, 200, 300],
    'min_samples_split': [2, 10],  # default value comes first
    'max_features': ['sqrt', 'log2']
}
rf_set = build_vset('RF', RandomForestRegressor, param_dict, criterion = 'absolute_error')
assert len(rf_set.modules) == 3*2*2

Returns

AssertionError                            Traceback (most recent call last)
Input In [2], in <module>
      1 param_dict = {
      2     'n_estimators': [100, 200, 300],
      3     'min_samples_split': [2, 10],  # default value comes first
      4     'max_features': ['sqrt', 'log2']
      5 }
      6 rf_set = build_vset('RF', RandomForestRegressor, param_dict, criterion = 'absolute_error')
----> 7 assert len(rf_set.modules) == 3*2*2

Torch required to run an example notebook

In one of the example notebooks, you use Torch, which is not in the requirements.
Perhaps, you could add that to dev requirements or create a separate file with extra requirements to run examples.
Ideally, also describe the process of running the examples and the need to install extra libraries.

veridical-flow/notebooks/03_computer_vision_dnn.ipynb

Line 26 in cfa83c1

"import torch\n",

Due to JOSS submission: openjournals/joss-reviews#3895

Pass list of Callables and list of param_dicts to build_vset

Hello again.

I'm having trouble passing a list of Callables and list of param_dicts to build_vset. The following error occurs: obj must be callable.
I'm passing a list of sklearn models and a corresponding list of parameter dicts. According to the documentation, this should work.

Missing logo in docs

Both in Firefox 92.0 and Chromium 92.0

Tests – warnings and code coverage

I get 5 warnings when running tests:
[tests.txt](https://github.com/Yu-Group/veridical-flow/files/7576566/tests.txt

Also, the code coverage is 69%, perhaps it could be a bit better:

)

Feel free to close this issue, just pointing out things that could be improved but perhaps don't need to.

Create new PyPI release

Ex issue

[JOSS Review] Example Notebooks

(as part of: openjournals/joss-reviews#3895)

I noticed some issues with the example notebooks that I summarized here:

In general, it would be very beneficial to have a bit more explanation in the example notebooks in order to better understand the cool features and advantages of VeridicalFlow.
The 00_synthetic_classification.ipynb notebook throws the following Exception when trying to execute it: module 'sklearn' has no attribute 'datasets'. My suggestion would be to add the line import sklearn.datasets. Furthermore, the function sklearn.datasets.load_boston is deprecated:
```
'load_boston' is deprecated in 1.0 and will be removed in 1.2.  
```
Maybe you want to consider using another dataset as an example in order to make it safe for future use?
Another general improvement suggestion would be to directly include example the notebooks into documentation instead of linking to the notebooks in the GitHub repo.

Your current solution implies that the example notebooks always have to be fully executed before committing to the repository to include the output in the notebook for. However, this has some drawbacks:
- In general, it's discouraged to commit notebooks with their output since every execution would change the notebook file
- And, of course, you would always need to remember to really run the notebooks before committing.
For building docs with sphinx this could be done using nbsphinx, To be honest, I don't know how it works with pdoc, but I'm sure similar solutions exist?

Too broad exception clause

Please specify the exception, you should control what exceptions can be expected here

veridical-flow/vflow/pipeline.py

Line 39 in cfa83c1

except:

Due to JOSS submission: openjournals/joss-reviews#3895

[JOSS Review] Code

(as part of: openjournals/joss-reviews#3895)

When looking at the code there are some points you might want to address which would make usage of the package even better:

For some functions and classes, docstrings are missing.
In general, the docstring formatting is inconsistent; some docstrings start with small letters, some with capital letters, some have a Short Summary at the beginning, some don't.
To easily detect those issues and to further improve code quality I would recommend using a code analysis tool such as prospector, which includes pylint for linting and pep8 for checking against style conventions. This could easily be integrated into your existing workflow and also works well with GitHub Actions. This would also make it easier for contributors to get up to speed quickly with VeridicalFlow.

prediction_uncertainty breaks subkey matching

Currently prediction_uncertainty uses dict_to_df to perform aggregation and then recreates Subkeys.
As a result the recreated Subkeys lose their prior _output_matching and _sep_dicts_id information. This can break pipelines that rely on mean_dict/std_dict for later parts of the pipeline.

Reproducible Example

X, y = sklearn.datasets.make_classification(n_samples=100, n_features=5)
X_train, X_test, y_train, y_test = init_args(train_test_split(X, y), names=['xtr', 'xte', 'ytr', 'yte'])

subsampling_funcs = [partial(sklearn.utils.resample, n_samples=80, random_state=i) for i in range(5)]
subsampling_set = Vset(name='subsample', modules=subsampling_funcs, output_matching=True)
X_trains, y_trains = subsampling_set(X_train, y_train)

subsampling_set_test = Vset(name='subsample_test', modules=subsampling_funcs, output_matching=True)
X_tests, y_tests = subsampling_set_test(X_test, y_test)

models = [LogisticRegression(max_iter=1000, tol=0.1), DecisionTreeClassifier()]
modeling_set = Vset(name='model', modules=models, module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)

# clamp mean predictions over test-set subsamples
mean_dict, std_dict, pred_stats_df = modeling_set.predict(X_tests, with_uncertainty=True, group_by=['subsample_test'])
mean_dict = {k: np.round(v) if k != PREV_KEY else v for k, v in mean_dict.items()}

failed_metrics = binary_metrics_set.evaluate(mean_dict, y_tests)
failed_metrics

failed_metrics should have matched mean_dict keys of the form (subsample_test_0, ) with y_tests keys of the form (yte, subsample_test_0), but cannot because the Subkeys have different _sep_dicts_id due to recreation.

Expected Output

{(subsample_test_0, yte, Acc): 0.925,
 (subsample_test_1, yte, Acc): 0.9125,
 (subsample_test_2, yte, Acc): 0.8875,
 (subsample_test_3, yte, Acc): 0.9,
 (subsample_test_4, yte, Acc): 0.9,
 (subsample_test_0, yte, Bal_Acc): 0.9268292682926829,
 (subsample_test_1, yte, Bal_Acc): 0.9,
 (subsample_test_2, yte, Bal_Acc): 0.8902439024390244,
 (subsample_test_3, yte, Bal_Acc): 0.9024390243902439,
 (subsample_test_4, yte, Bal_Acc): 0.9069767441860466, ...}

Actual Output

{(subsample_test_0, yte, subsample_test_0, Acc): 0.9125,
 (subsample_test_0, yte, subsample_test_1, Acc): 0.5875,
 (subsample_test_0, yte, subsample_test_2, Acc): 0.4875,
 (subsample_test_0, yte, subsample_test_3, Acc): 0.475,
 (subsample_test_0, yte, subsample_test_4, Acc): 0.6,
 (subsample_test_1, yte, subsample_test_0, Acc): 0.625,
 (subsample_test_1, yte, subsample_test_1, Acc): 0.925,
 (subsample_test_1, yte, subsample_test_2, Acc): 0.525,
 (subsample_test_1, yte, subsample_test_3, Acc): 0.4375,
 (subsample_test_1, yte, subsample_test_4, Acc): 0.5375, ... }

perturbation_stats unclear on mismatched Subkeys

It is unclear how perturbation_stats should handle multiple Subkeys with the same origin (thus the same column name in df).
Currently attempting to group on a duplicated column throws ValueError: Grouper for 'subsample' not 1-dimensional.

The illustrative example of this issue comes if we take the exact example pipeline from #35 and attempt to use a single subsample Vset with output_matching=False (so the X_trains/X_tests will match properly) instead of the two. Now if we want to predict with uncertainty over subsamples, it is unclear what this means. I think there are 2 cases:

My initial thought we could implement a way to distinguish identical mismatched Subkeys (maybe by appending -i)
Alternatively/additionally we could try to support multidimensional grouping in perturbation_stats

Illustrative Example

X, y = sklearn.datasets.make_classification(n_samples=100, n_features=5)
X_train, X_test, y_train, y_test = init_args(train_test_split(X, y), names=['xtr', 'xte', 'ytr', 'yte'])

subsampling_funcs = [partial(sklearn.utils.resample, n_samples=80, random_state=i) for i in range(5)]
subsampling_set = Vset(name='subsample', modules=subsampling_funcs)
X_trains, y_trains = subsampling_set(X_train, y_train)
X_tests, y_tests = subsampling_set(X_test, y_test)

models = [LogisticRegression(max_iter=1000, tol=0.1), DecisionTreeClassifier()]
modeling_set = Vset(name='model', modules=models, module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)

# clamp mean predictions over test-set subsamples
mean_dict, std_dict, pred_stats_df = modeling_set.predict(X_tests, with_uncertainty=True, group_by=['subsample'])
mean_dict = {k: np.round(v) if k != PREV_KEY else v for k, v in mean_dict.items()}

Test xfails

As far as I understand xfail tests, it is supposed to be a temporary flag rather than a test for a failing case.
Could you explain why are you using it here?

veridical-flow/tests/test_convert.py

Line 465 in cfa83c1

# this test is expected to fail because combine_dicts() makes the

Redundant requirement

It seems you're not using sklearn anywhere in the package.
If that's the case and it is used for examples, perhaps you could create a separate requirements set and keep the original as slim as possible.

veridical-flow/setup.py

Line 26 in cfa83c1

'scikit-learn >=0.23.0', # 0.23+ only works on py3.6+)

Due to JOSS submission: openjournals/joss-reviews#3895

Notebook example – no data given in the link

I run the examples, and the example below uses the data not uploaded to the repo.

veridical-flow/notebooks/01_enhancer.ipynb

Line 69 in cfa83c1

"data_dir = \"../../../_other/Enhancer/data/\"\n",

Due to JOSS submission: openjournals/joss-reviews#3895

Regression Pipeline R2 values are wildly negative

Calling hard_metrics_set.evaluate() using r2_score or explained_variance_score produces highly negative values across all voxels in the fMRI notebook.

Error when calling `fit_transform` for `Vset` with `is_async=True`

In the example below, when using a Vset with is_async=True, the transform method expects to get a ray.objectRef and call ray.get on it, but instead gets an array:

from vflow import build_vset, init_args

import numpy as np

from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

import ray

ray.init(num_cpus=4)

X, y = make_regression(n_samples=1000, n_features=100, n_informative=1)

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)

X_train, y_train = init_args([X_train, y_train], names=['X_train', 'y_train'])
X_val, y_val = init_args([X_val, y_val], names=['X_val', 'y_val'])

# create a Vset for bootstrapping from data 10 times
# we use lazy=True so that the data will not be resampled until needed
boot_set = build_vset('boot', resample, reps=10, lazy=True)

# bootstrap from training data by calling boot_fun
X_trains, y_trains = boot_set(X_train, y_train)

# hyperparameters to try
pca_params = {
    'n_components': [10, 20, 50],
    'svd_solver': ['randomized', 'full', 'auto']
}

# we could instead pass a list of distinct models and corresponding param dicts
pca_set = build_vset('PCA', PCA, pca_params, is_async=True)

X_trains_pca = pca_set.fit_transform(X_trains)

TypeError: Attempting to call `get` on the value [[-0.73763296 -1.64044139 -0.74793088 ... -0.1085027  -0.25652127
   0.11583096]
...

See #50 for a possible workaround until this is fixed.

[JOSS Review] Paper

(as part of: openjournals/joss-reviews#3895)

Just some minor remarks regarding the paper:

line 2: Explanation of PCS should be introduced when first mentioning it
line 27: Wrong citation

Redundant parentheses

veridical-flow/vflow/convert.py

Line 247 in cfa83c1

if (len(data_dict) == 0):

Also in some other places in the code.

Due to JOSS submission: openjournals/joss-reviews#3895

Access best_params_ of GridSearchCV after training

Hi and thanks for open-sourcing.

I'm wondering if their is a way to access an estimator's attributes, like best_params_ of GridSearchCV after training is complete?

Default argument equals [], {}

I believe it is generally discouraged to use [] and perhaps similarly {} as a default function argument:

veridical-flow/vflow/convert.py

Line 122 in cfa83c1

def sep_dicts(d: dict, n_out: int = 1, keys: list = []):

See https://stackoverflow.com/questions/1132941/least-astonishment-and-the-mutable-default-argument.
One idea is to set the default to None.
This happens several times in this repository, so if you decide to change it, please search for all occurences

yu-group / veridical-flow Goto Github PK

veridical-flow's People

Contributors

Stargazers

Watchers

Forkers

veridical-flow's Issues

prediction_uncertainty breaks subkey matching

Reproducible Example

Expected Output

Actual Output

Illustrative Example

Recommend Projects

Recommend Topics

Recommend Org