neuraxio / neuraxle Goto Github PK

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.

Home Page: https://www.neuraxle.org/

License: Apache License 2.0

Python 99.93% Shell 0.07%

pipeline pipeline-framework machine-learning deep-learning framework python-library hyperparameter-optimization hyperparameter-tuning hyperparameter-search hyperparameters

neuraxle's Introduction

Neuraxio

neuraxle's People

Contributors

Stargazers

Watchers

Forkers

guillaume-chevalier alexbrillant jeromeblanchet eric2hamel gitter-badger jamesliu mlevesquedion vaunorage johnsonc-sorcero samithaj yushu-liu victor8733 trendingtechnology scape1989 aixioma data-processing yaxche-io tspannhw rodcanada sanyam07 tqcai ipmeme astrogilda stjordanis harshmathur1990 jaykimbravekjh lxngoddess5321 jaybravekjhkim lifesonai hopesdad shoman2 neurodata-ltd chetanmehra maybeee18 cmftall amirstudy vincent-antaki tubbz-alt hanktrident hello0630 buildjet kshaz kushaldev75 limeflow mekza yoonslab danielschulz sturmianseq kimoby ar-aln olivierbinette gbilkhu sanyamlakhanpal harness-community iq-scm d4capital mbrukman

neuraxle's Issues

Allow Rest API Pipeline Wrapper with Encoders and Decoders to expose fit method if desired.

Note: this is not urgent.

With this we'd need a different encoder and decoder for fit.

Todo: post is fit, get is transform.

Any user could implement one, the other, or both.

Missing features

One missing feature is lack of
Features union, subset feature processing.

Imagine you have tex and numerics in same dataframe...

Checkpoints Should Only Save Data Inputs

Output Transformers will perform a list zip operation to be able to save the transformed expected outputs on disk as well.

class OutputTransformerWrapper(MetaStepMixin, BaseStep):
    def __init__(self, wrapped: BaseStep):
        MetaStepMixin.__init__(self, wrapped)

    def transform(self, data_inputs):
        data_inputs, expected_outputs = data_inputs
        return self.wrapped.transform(list(zip(data_inputs, expected_outputs)))

Resume Pipeline With Fitted Step, And Checkpoint At The Same Time

Not only load the data, but also the fitted steps with checkpoints.
BaseStepCheckpoint
BaseDataCheckpoint

Remove all the `_one` methods?

Could the streaming pipelines or minibatch streaming pipelines just use the regular methods and let the implementer of the step choose how to handle many successive minibatches or single items?

Implement `fit_transform` in the step cloner and in the sklearnwrapper

This might avoid lots of duplicated computations to do them together when appropriate (especially when a cloned or wrapped step is itself a pipeline for example).

Must include steps' own inner source code version when hashing by hyperparameter.

This will also simplify the checkpointing of steps.

Add Neuraxle to Awesome Python

The pull request here needs 20 thumbs up (+1 👍) for it to be merged. Please leave your thumbs up at the pull request here. If you see that it already have 20 thumbs up, bump again the bot perhaps.

Test test_lognormal in testing/hyperparams/test_distributions.py fails randomly

This test can fail randomly, but not very often :

def test_lognormal():
    hd = LogNormal(0.0, 2.0)

    samples = get_many_samples_for(hd)

    samples_median = np.median(samples)
    assert 0.9 < samples_median < 1.1
    samples_std = np.std(samples)
    assert 5 < samples_std < 8

@guillaume-chevalier really strange any ideas why that would happen ?

Reverse method is not available for StepClonerForEachDataInput

Implement FakeHandle For ResumingPipeline to respect TDA

todo :

what happens when we have different number of data inputs than transformed data inputs ?

Uncomplete Documentation Strings (Docstrings)

Many classes have incomplete docstring. Sometimes, there are also typos.

The docstrings in classes are used with the sphinx documentation builder to build the website's complete documentation API:
https://www.neuraxle.neuraxio.com/stable/api.html

Random `Uniform` distribution: wrong type annotation in constructor

Replace int with float.

Inverted equality check to be fixed in StepClonerForEachDataInput

Use >= instead of <=.

assert len(data_inputs) <= len(self.steps), "Can't have more data_inputs than cloned steps to process them."

ReversiblePreprocessingWrapper

We'll need something like this:

class ReversiblePreprocessingWrapper(MetaStepMixin, BaseStep): 
    def __init__(
        self, 
        wrapped_step: BaseStep, 
        reversible_preprocessing_pipeline: BaseStep
    ):
        pass  # ...

    def transform(self, data_inputs): 
        data_inputs = self.reversible_preprocessing_pipeline.transform(data_inputs)
        data_inputs = self.wrapped_step.transform(data_inputs)
        data_inputs = self.reversible_preprocessing_pipeline.inverse_transform(data_inputs)
        return data_inputs

Create Joiners for Streaming Pipelines

Note: not for #83. To do later.

Teardown method should be called on pipeline steps exceptions

Never teardown automatically. Wait for a manual teardown or wait for `del` to be called.

Add __del__ to BaseStep, which calls self.teardown().

StepClonerForEachDataInput crashes on not implemented base step methods i.e inverse transform

FlattenForEach step

We already have a ForEachDataInputs and a CloneStepForEachDataInputs. I think we also need a Flatten3Dto2DForEachDataInputs which is the same concept as a ForEachDataInputs but that reduces a dimension instead of manually looping on it.

Example: instead of looping on the data like for di in data_inputs: self.wrapped.transform(di);, the Flatten3Dto2DForEachDataInputs step instead does that:

reduced_data_inputs = sum(list(data_inputs), [])  # converts data inputs from 3D to 2D. 
outs = self.wrapped.transform(reduced_data_inputs)

# bear with me: the following is like doing: out.reshape(len(data_inputs), outs.shape[0]/len(data_inputs), *outs.shape[1:]...)
# but we can't call `.shape` nor `.reshape` because the data type might not be a np.array or might not be a list, we want to keep things generic. 
reshaped_outs_to_reaugmented = self._re_augment_data(outs)  

return reshaped_outs_to_reaugmented

Note: what self._re_augment_data is to re-create the missing dimension SUCH THAT THE ARRAY THAT WE RETURN has the same number of dimensions that the data inputs had.

Add DataLoader class: a way to lazy-load iterable datasets.

Using generators lazily and for instance overloading the iter and len methods. To be used with Streaming Pipelines. DataLoaders should be able to be nested.

class DataLoader: 

    def __iter__(self): 
        # ...

    def __getitem__(self): 
        # ...

    def __len__(self): # Len must be defined not to empty/iterate the loader completely upon just checking the length. 
        # ...

Question: could they be without length / infinite?

I'd also like to think about how we could have things that enable to duplicate the data (e.g.: introduce local shuffle (with window size) or the concept of epoch loops to train NNs.

Perhaps use an `apply` method to avoid duplicate code.

TODO: read and understand all the code contained in PyTorch's nn.Module class:
https://github.com/pytorch/pytorch/blob/d3e90bc47d21149545992f183ee4130a79934cca/torch/nn/modules/module.py#L31

This nn.Module class works somehow like our TruncableSteps or somehow like our BaseStep, which makes it interesting code to read to get inspiration.

Especially look at Module.apply, Module._apply, Module.apply, Module.cuda, Module.float, Module.to, the "hook" methods, Module.parameters, Module.named_parameters, childrens, modules, named modules, Module.train, Module.eval, and so forth.

Perhaps this could be useful to avoid duplicate code in the TruncableStep and in the BaseStep. For example, it seems to me that all those classes could use the same "apply" logic and thus avoid duplicating code as is done currently: get_hyperparams, set_hyperparams, get_hyperparams_space, set_hyperparams_space, setup, teardown, and so forth. We might want to think of pipelines as trees in which we can apply functions. I'd like to validate this idea.

Add auth and https optional arguments to FlaskApiWrapper

BaseStep must have a custom saver.

There is a problem in the following code:

class ResumablePipeline(Pipeline, ResumableStepMixin):
    """
    Fits and transform steps after latest checkpoint
    """

    def __init__(self, steps: NamedTupleList, pipeline_saver: PipelineSaver = None):
        Pipeline.__init__(self, steps=steps)

        if pipeline_saver is None:
            self.pipeline_saver = JoblibPipelineSaver(DEFAULT_CACHE_FOLDER)
        else:
            self.pipeline_saver = pipeline_saver

    # ...

It is that the pipeline decides of the saver. However, that's wrong (invalid). The pipeline should allow the steps to use a custom saver. For instance, a TensorFlow, Keras, or PyTorch model will need special serialization using their own methods.

Suggested fix

Have a class like the hasher that allows the objects to change how they are saved.

It is okay for a resumable pipeline to be able to pass in a default saver, but just when the pipeline steps don't have a saver of themselves. The pipeline can't force the saving.

What it will impact

The pipeline steps might not need a reference to the parent anymore to be able to save themselves. They should save themselves in a directory passed to them in the context.

Suggested fix to do in `_fit_transform_core` and other core methods:

for step_name, step in steps_left_to_do:
    step, data_container = step.handle_fit_transform(data_container, context)

The context class:

class Context: 
    - current_tmp_path: str  # path for the current object. 
    - stack_of_tmp_paths_of_parents: List[str]
    - stack_of_parents: List[BaseStep]
    - stack_is_parent_saved: List[bool]  # useful to avoid overwriting too many times. 

    def pop(): 
        return Context(
            self.stack_of_tmp_paths_of_parents[-1], 
            self.stack_of_tmp_paths_of_parents[:-1], 
            self.stack_of_parents[:-1]
        )

    def push():
        # the inverse of pop. Here, add something on the stack instead of removing.

MiniBatchSequentialPipeline Default Barrier

As a special measure for the final and last step of the whole pipeline, the pipeline could provide a default barrier maybe.

Add back hasher in base step

Pipeline Runners should be able to transform x AND y at the same time provided a new OutputTransformerMixin step.

See the Autoregress in this slide of the talk: https://youtu.be/WXWDDEkuSaE?t=513

Autoregress takes an input X and returns not just an X upon transform, but also creates a Y. Example:
X_subset, Y_subset = Autoregress().transform(X)

We could perhaps have a Mixin class that is an OutputTransformerMixin. The PipelineRunner, upon seeing this class, would know that the class changes the X and the Y at the same time. E.g.:

if isinstance(step, OutputTransformerMixin): 
    X, y = step.transform(X, y)

This is to be done within the transform loop and the fit_transform loop. So for example, a fit transform would be unpackeable this:

if isinstance(step, OutputTransformerMixin): 
    step, (X, y) = step.fit_transform(X, y)

fix `fit_transform` in sklearnwrapper

Consider using if hasattr(self.wrapped_sklearn_predictor, 'fit_transform'): which is important to save time (e.g.: avoid doing fit then transform which might duplicate some computations and can cause pipelines to take 2x the time to compute).

Perhaps that the MetaStepMixin should have its own hyperparameters too, not just passing them to the wrapped, like it's done in the TruncableSteps.

Truncable steps remove "" from the params' keys in the set_hyperparams method, because it collects its own "terminal" hyperparams which doesn't have some "", and otherwise it passes the ones with some "__" down the chain. MetaStepMixin should do the same, and perhaps the sklearn wrapper too.

Pipeline checkpoints needs to hash data and hyperparams

A pipeline can be run on many datasets and it can be re-trained with many different hyperparameters on all of those datasets. Thus, we need a way to make the difference between each checkpoint. This will allow hyperparameter tuning when the hyperparameters of the steps change, and this will allow not reusing the same checkpoints between train data and test data (and other data) if checkpoints are enabled.

Dynamically create subfolders of pipeline cache according to:

data_hash: The hash of the input data to the pipeline. We don't want to load checkpoints for new data by mistake.
hyperparams_hash: The hyperparameter samples of the pipeline, e.g.: hash(p.get_hyperparams()).

This means that for each pipeline, the subfolders tree could look like that for example:

./cache
    {data_hash}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        subpipeline_c_{hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
        subpipeline_c_{another hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
        subpipeline_c_{also another hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
    {data_hash for another dataset}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        ...
    {data_hash for still another dataset}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        ...

Interesting facts and discussion points, assuming each step or most step is checkpointed :

Sometimes, the hyperparameters of a few pipeline steps will be the same, and only the final pipeline step will change. This means it's possible to reuse the same checkpoints for each first steps given a dataset, and only the last step will need two different checkpoints.
If a pipeline step has hyperparameter that changes, but that is executed on the same data, the checkpoint name (suffix past a final underscore delimiter "__") will be different. (or if hash is fast to compute, check if the new checkpoint is the same than the old one, and if so it's possible to avoid re-writing to disks?)
Hashing huge numpy arrays can be a lengthy process, so perhaps we could add hashers such as just taking the shape of the input array to hash it when the input is an np array, and so forth.
The hasher could be sent as an argument of the PipelineRunner or Pipeline, and could be deactivated by sending a hasher that always returns the same value such that every checkpoint just always trigger (?). In fact, there should also be a way to deactivate checkpoints completely (e.g.: for sending models in production).
For AutoML, we need to reuse the same checkpoints if the hyperparameters of the previous steps AND the current step are unchanged (hashes needs to take ranges of steps before the checkpoint, not just the hyperparams of the checkpoint itself).

Rename `Pipeline` to `CachingPipeline`, and add `.topipeline()` to the `CachingPipeline` which returns a vanilla Pipeline with purged Checkpoints.

This would provide a way to disable checkpointing for putting into prod after training.

Add Probability Density Functions (PDFs) of Hyperparameter Distributions

This is needed for AutoML algorithms that needs the PDF and not only the .rvs() random variable sample generator.

@Eric2Hamel don't hesitate to assign yourself to this issue if you want to do it :)

Errors related to the `HyperparameterSamples` and to `HyperparameterSpace` types.

A few things to fix:

The constructor of the BaseStep should parse the provided hyperparameters to HyperparameterSamples and to HyperparameterSpace types by using the setter methods to ensure that their types is converted if a simple dict was provided by error.
The get_hyperparams_space and get_hyperparams of the truncable steps should not return a dict, but instead, should return something of the good type (HyperparameterSamples or HyperparameterSpace).
The HyperparameterSpace.rvs() method should perhaps return a HyperparameterSamples object instead of a HyperparameterSpace object since the distribution collapses from a space to a sample of the space upon calling rvs (random variable sample).

Optional points:

See if by default we want the spaces and the samples to be flat or nested. After usage, it seems to me that it might be good to have them flat by default instead of nested by default.

Allow returning hashes and/or IDs as inputs in the JSONDataResponseEncoder for FlaskRESTApiWrapper

Inside the handle transform of JSONDataResponseEncoder, add hashed and/or IDs inside the return for JSONDataReponseEncoder used in FlaskRESTApiWrapper.

AutoMLSequentialWrapper

Do something like this for meta_fit:

class AutoMLSequentialWrapper:


	def __init__(self, wrapped_pipeline, auto_ml_strategy, validation_technique, score_function, hyperparams_repository, n_iters): 

		self.toute = toute...

	def fit(self, di, eo): 

		for i in n_iters: 
			hps: List[HyperparameterSamples], scores: List[float] = hyperparams_repository.load_all()
			
			auto_ml_strategy = auto_ml_strategy.fit(hps, scores)
			
			next_model_to_try_hps = auto_ml_strategy.guess_next_best_params(i, n_iters, wrapped_pipeline.get_hyperparams_space())
			hyperparams_repository.register_new_untrained_trial(next_model_to_try_hps)
			
			validation_wrapper = validation_technique(copy(wrapped_pipeline).set_hyperparams(next_model_to_try_hps))
			validation_wrapper, predicted_eo = validation_wrapper.fit_transform(di, eo)

			score = score_function(predicted_eo, eo)  # TODO: review order of arguments here.

			hyperparams_repository.set_score_for_trial(next_model_to_try_hps, score)

I'd like to validate the OOP object structure. For instance, what will we do when we'll run trials in parallel? This for loop is not enough, it'd be more like a pool of workers that tries the N next best samples.

We also need a way to indicate that the trial crashed so that the auto_ml_strategy doesn't try that point again.

Any comments/suggestions on that @mlevesquedion @alexbrillant @Eric2Hamel?

Broken Pipeline Runner causes infinite recursion.

The default pipeline runner, an argument of the Pipeline class' constructor, is reused across different Pipeline instances as the default argument is created only once. This sometimes causes a recursion error when the pipeline runner gets its steps setted everywhere at once with set_steps and loops on itself.

Quick fix: do a copy of the pipeline runner in the constructor of the Pipeline as such:
self.pipeline_runner: BasePipelineRunner = copy(pipeline_runner)

Better ways to fix that may be possible.

Have the hyperparams setters (and space setters) of the TruncableSteps and MetaStepMixin throw errors when the name of the substep doesn't exist

E.x.:

p = Pipeline([
    ("123", SomeStep()),
    ("xyz", SomeStep(learning_rate=0.1)),
])
p.set_hyperparams({"abc__learning_rate": 0.001})

Should throw:

StepNotFoundError: The step "abc" doesn't exist. Did you mean something in ["123", "xyz"]?

Test Mutate Does Not Affect Inspect Method Code Source

Make sure that the source doesn't change if we mutate a step. Create a unit tests for this.

StepClonerForEachDataInput doesn't propagate hyperparams.

This is the same problem as described in #28, however here it is about the StepClonerForEachDataInput: it doesn't get_hyperparams not set_hyperparams to its contained pipelines. It'd be good here to not return many duplicates of the same params and have something that upon a get, gets oinly the hyperparams of one instance, and upon a set, sets the same hyperparams for each instance.

Same goes for spaces with set_hyperparams_space.

Need a better common base class for meta steps (that handles `get_hyperparams` and `set_hyperparams`)

Problem:

MetaStep and MetaSteps doesn't implement get_hyperparams nor set_hyperparams.
TruncableSteps does.

What should be done about it:

Move some logic from TruncableSteps to MetaSteps
Have MetaStep act the same, probably by inheriting MetaSteps but setting only one such meta step.

Other better solutions could probably be possible. Basically, we need not only to have nested (recursive) pipelines to be able to return their hyperparams, but also nested objects that are MetaStep(s). MetaStep(s) do contain other step(s) and should be able to get and set their hyperparams recursively as done in TruncableSteps.

@alexbrillant I'd like your thoughts on this.

Teardown steps on a del method implemented in the BaseStep

_load_saved_pipeline_steps_before_index refactor with setitems

Implement get_hyperparams and get_hyperparams with flat parameter in BaseStep without breaking SKLearnWrapper

Create streaming pipelines that use a batch_size.

Add BaseStep.config, BaseStep.get_config() and BaseStep.set_config()

The concept is the same as for the Hyperparameter Samples and the Hyperparameter Spaces. But a config shouldn't change what happens to the data, just how it is treated (e.g.: number of cores).

It's interesting to move some parameters to a config for when those parameters are system-related or misc. We don't want some of those parameters to alter the hashes (e.g.: n_jobs in FeatureUnion shouldn't change the outcome and should be modified to be such a config parameter).

StreamingPipeline to have a `.tosequential()` method to convert it to a sequential pipeline. (low priority)

DataObjects To Contain Hashes And Items (x, and/or y)

Use Data Objects Inside the Pipeline to keep track of ids and hashes

class DataObject:
    def __init__(self, i, x):
        self.i = i	       
        self.x = x
 
        def __hash__(self):	     
            return hash((self.i, self.x))

Create distributed and parallel pipeline steps wrappers using object serialization to pass data from a master pipeline to parallel workers to ease scaling

Allow cluster computing (e.g.: in the cloud with any cloud provider).

Timeline: this is something to start thinking about in 2020.

Duplicated version number

When updating the lib for deployment on PyPI, changing the version number needs to be done at two places:

It'd be cool if the version number could be changed at only 1 place to change it everywhere.

Implement Pipeline Setup And Teardown Methods

Add setup and teardown methods to base step.
Setup : Recursively call setup methods of each pipeline steps at the beginning of the execution.
Teardown : Will be used to close session, connections, etc. at the end of the execution