Coder Social home page Coder Social logo

neuraxio / neuraxle Goto Github PK

View Code? Open in Web Editor NEW
597.0 19.0 60.0 8.36 MB

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.

Home Page: https://www.neuraxle.org/

License: Apache License 2.0

Python 99.93% Shell 0.07%
pipeline pipeline-framework machine-learning deep-learning framework python-library hyperparameter-optimization hyperparameter-tuning hyperparameter-search hyperparameters

neuraxle's Introduction

Neuraxio

neuraxle's People

Contributors

alexbrillant avatar eric2hamel avatar guillaume-chevalier avatar jeromeblanchet avatar mlevesquedion avatar vaunorage avatar vincent-antaki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuraxle's Issues

Missing features

One missing feature is lack of
Features union, subset feature processing.

Imagine you have tex and numerics in same dataframe...

Checkpoints Should Only Save Data Inputs

Output Transformers will perform a list zip operation to be able to save the transformed expected outputs on disk as well.

class OutputTransformerWrapper(MetaStepMixin, BaseStep):
    def __init__(self, wrapped: BaseStep):
        MetaStepMixin.__init__(self, wrapped)

    def transform(self, data_inputs):
        data_inputs, expected_outputs = data_inputs
        return self.wrapped.transform(list(zip(data_inputs, expected_outputs)))

Remove all the `_one` methods?

Could the streaming pipelines or minibatch streaming pipelines just use the regular methods and let the implementer of the step choose how to handle many successive minibatches or single items?

Add Neuraxle to Awesome Python

The pull request here needs 20 thumbs up (+1 👍) for it to be merged. Please leave your thumbs up at the pull request here. If you see that it already have 20 thumbs up, bump again the bot perhaps.

ReversiblePreprocessingWrapper

We'll need something like this:

class ReversiblePreprocessingWrapper(MetaStepMixin, BaseStep): 
    def __init__(
        self, 
        wrapped_step: BaseStep, 
        reversible_preprocessing_pipeline: BaseStep
    ):
        pass  # ...

    def transform(self, data_inputs): 
        data_inputs = self.reversible_preprocessing_pipeline.transform(data_inputs)
        data_inputs = self.wrapped_step.transform(data_inputs)
        data_inputs = self.reversible_preprocessing_pipeline.inverse_transform(data_inputs)
        return data_inputs

FlattenForEach step

We already have a ForEachDataInputs and a CloneStepForEachDataInputs. I think we also need a Flatten3Dto2DForEachDataInputs which is the same concept as a ForEachDataInputs but that reduces a dimension instead of manually looping on it.

Example: instead of looping on the data like for di in data_inputs: self.wrapped.transform(di);, the Flatten3Dto2DForEachDataInputs step instead does that:

reduced_data_inputs = sum(list(data_inputs), [])  # converts data inputs from 3D to 2D. 
outs = self.wrapped.transform(reduced_data_inputs)

# bear with me: the following is like doing: out.reshape(len(data_inputs), outs.shape[0]/len(data_inputs), *outs.shape[1:]...)
# but we can't call `.shape` nor `.reshape` because the data type might not be a np.array or might not be a list, we want to keep things generic. 
reshaped_outs_to_reaugmented = self._re_augment_data(outs)  

return reshaped_outs_to_reaugmented

Note: what self._re_augment_data is to re-create the missing dimension SUCH THAT THE ARRAY THAT WE RETURN has the same number of dimensions that the data inputs had.

Add DataLoader class: a way to lazy-load iterable datasets.

Using generators lazily and for instance overloading the iter and len methods. To be used with Streaming Pipelines. DataLoaders should be able to be nested.

class DataLoader: 

    def __iter__(self): 
        # ...

    def __getitem__(self): 
        # ...

    def __len__(self): # Len must be defined not to empty/iterate the loader completely upon just checking the length. 
        # ...

Question: could they be without length / infinite?

I'd also like to think about how we could have things that enable to duplicate the data (e.g.: introduce local shuffle (with window size) or the concept of epoch loops to train NNs.

Perhaps use an `apply` method to avoid duplicate code.

TODO: read and understand all the code contained in PyTorch's nn.Module class:
https://github.com/pytorch/pytorch/blob/d3e90bc47d21149545992f183ee4130a79934cca/torch/nn/modules/module.py#L31

This nn.Module class works somehow like our TruncableSteps or somehow like our BaseStep, which makes it interesting code to read to get inspiration.

Especially look at Module.apply, Module._apply, Module.apply, Module.cuda, Module.float, Module.to, the "hook" methods, Module.parameters, Module.named_parameters, childrens, modules, named modules, Module.train, Module.eval, and so forth.

Perhaps this could be useful to avoid duplicate code in the TruncableStep and in the BaseStep. For example, it seems to me that all those classes could use the same "apply" logic and thus avoid duplicating code as is done currently: get_hyperparams, set_hyperparams, get_hyperparams_space, set_hyperparams_space, setup, teardown, and so forth. We might want to think of pipelines as trees in which we can apply functions. I'd like to validate this idea.

BaseStep must have a custom saver.

There is a problem in the following code:

class ResumablePipeline(Pipeline, ResumableStepMixin):
    """
    Fits and transform steps after latest checkpoint
    """

    def __init__(self, steps: NamedTupleList, pipeline_saver: PipelineSaver = None):
        Pipeline.__init__(self, steps=steps)

        if pipeline_saver is None:
            self.pipeline_saver = JoblibPipelineSaver(DEFAULT_CACHE_FOLDER)
        else:
            self.pipeline_saver = pipeline_saver

    # ...

It is that the pipeline decides of the saver. However, that's wrong (invalid). The pipeline should allow the steps to use a custom saver. For instance, a TensorFlow, Keras, or PyTorch model will need special serialization using their own methods.

Suggested fix

Have a class like the hasher that allows the objects to change how they are saved.

It is okay for a resumable pipeline to be able to pass in a default saver, but just when the pipeline steps don't have a saver of themselves. The pipeline can't force the saving.

What it will impact

The pipeline steps might not need a reference to the parent anymore to be able to save themselves. They should save themselves in a directory passed to them in the context.

Suggested fix to do in _fit_transform_core and other core methods:

for step_name, step in steps_left_to_do:
    step, data_container = step.handle_fit_transform(data_container, context)

The context class:

class Context: 
    - current_tmp_path: str  # path for the current object. 
    - stack_of_tmp_paths_of_parents: List[str]
    - stack_of_parents: List[BaseStep]
    - stack_is_parent_saved: List[bool]  # useful to avoid overwriting too many times. 

    def pop(): 
        return Context(
            self.stack_of_tmp_paths_of_parents[-1], 
            self.stack_of_tmp_paths_of_parents[:-1], 
            self.stack_of_parents[:-1]
        )

    def push():
        # the inverse of pop. Here, add something on the stack instead of removing. 

Pipeline Runners should be able to transform x AND y at the same time provided a new OutputTransformerMixin step.

See the Autoregress in this slide of the talk: https://youtu.be/WXWDDEkuSaE?t=513

Autoregress takes an input X and returns not just an X upon transform, but also creates a Y. Example:
X_subset, Y_subset = Autoregress().transform(X)

We could perhaps have a Mixin class that is an OutputTransformerMixin. The PipelineRunner, upon seeing this class, would know that the class changes the X and the Y at the same time. E.g.:

if isinstance(step, OutputTransformerMixin): 
    X, y = step.transform(X, y)

This is to be done within the transform loop and the fit_transform loop. So for example, a fit transform would be unpackeable this:

if isinstance(step, OutputTransformerMixin): 
    step, (X, y) = step.fit_transform(X, y)

fix `fit_transform` in sklearnwrapper

Consider using if hasattr(self.wrapped_sklearn_predictor, 'fit_transform'): which is important to save time (e.g.: avoid doing fit then transform which might duplicate some computations and can cause pipelines to take 2x the time to compute).

Pipeline checkpoints needs to hash data and hyperparams

A pipeline can be run on many datasets and it can be re-trained with many different hyperparameters on all of those datasets. Thus, we need a way to make the difference between each checkpoint. This will allow hyperparameter tuning when the hyperparameters of the steps change, and this will allow not reusing the same checkpoints between train data and test data (and other data) if checkpoints are enabled.

Dynamically create subfolders of pipeline cache according to:

  • data_hash: The hash of the input data to the pipeline. We don't want to load checkpoints for new data by mistake.
  • hyperparams_hash: The hyperparameter samples of the pipeline, e.g.: hash(p.get_hyperparams()).

This means that for each pipeline, the subfolders tree could look like that for example:

./cache
    {data_hash}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        subpipeline_c_{hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
        subpipeline_c_{another hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
        subpipeline_c_{also another hyperparams_hash for subpipeline_c}/
            step_d_{hyperparams_hash for step_d}.pickle
            step_e_{hyperparams_hash for step_e}.pickle
            step_f_{hyperparams_hash for step_f}.pickle
    {data_hash for another dataset}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        ...
    {data_hash for still another dataset}/
        step_a_{hyperparams_hash for step_a}.pickle
        step_b_{hyperparams_hash for step_b}.pickle
        step_b_{another hyperparams_hash for step_b}.pickle
        ...

Interesting facts and discussion points, assuming each step or most step is checkpointed :

  • Sometimes, the hyperparameters of a few pipeline steps will be the same, and only the final pipeline step will change. This means it's possible to reuse the same checkpoints for each first steps given a dataset, and only the last step will need two different checkpoints.
  • If a pipeline step has hyperparameter that changes, but that is executed on the same data, the checkpoint name (suffix past a final underscore delimiter "__") will be different. (or if hash is fast to compute, check if the new checkpoint is the same than the old one, and if so it's possible to avoid re-writing to disks?)
  • Hashing huge numpy arrays can be a lengthy process, so perhaps we could add hashers such as just taking the shape of the input array to hash it when the input is an np array, and so forth.
  • The hasher could be sent as an argument of the PipelineRunner or Pipeline, and could be deactivated by sending a hasher that always returns the same value such that every checkpoint just always trigger (?). In fact, there should also be a way to deactivate checkpoints completely (e.g.: for sending models in production).
  • For AutoML, we need to reuse the same checkpoints if the hyperparameters of the previous steps AND the current step are unchanged (hashes needs to take ranges of steps before the checkpoint, not just the hyperparams of the checkpoint itself).

Errors related to the `HyperparameterSamples` and to `HyperparameterSpace` types.

A few things to fix:

  • The constructor of the BaseStep should parse the provided hyperparameters to HyperparameterSamples and to HyperparameterSpace types by using the setter methods to ensure that their types is converted if a simple dict was provided by error.
  • The get_hyperparams_space and get_hyperparams of the truncable steps should not return a dict, but instead, should return something of the good type (HyperparameterSamples or HyperparameterSpace).
  • The HyperparameterSpace.rvs() method should perhaps return a HyperparameterSamples object instead of a HyperparameterSpace object since the distribution collapses from a space to a sample of the space upon calling rvs (random variable sample).

Optional points:

  • See if by default we want the spaces and the samples to be flat or nested. After usage, it seems to me that it might be good to have them flat by default instead of nested by default.

AutoMLSequentialWrapper

Do something like this for meta_fit:

class AutoMLSequentialWrapper:


	def __init__(self, wrapped_pipeline, auto_ml_strategy, validation_technique, score_function, hyperparams_repository, n_iters): 

		self.toute = toute...

	def fit(self, di, eo): 

		for i in n_iters: 
			hps: List[HyperparameterSamples], scores: List[float] = hyperparams_repository.load_all()
			
			auto_ml_strategy = auto_ml_strategy.fit(hps, scores)
			
			next_model_to_try_hps = auto_ml_strategy.guess_next_best_params(i, n_iters, wrapped_pipeline.get_hyperparams_space())
			hyperparams_repository.register_new_untrained_trial(next_model_to_try_hps)
			
			validation_wrapper = validation_technique(copy(wrapped_pipeline).set_hyperparams(next_model_to_try_hps))
			validation_wrapper, predicted_eo = validation_wrapper.fit_transform(di, eo)

			score = score_function(predicted_eo, eo)  # TODO: review order of arguments here.

			hyperparams_repository.set_score_for_trial(next_model_to_try_hps, score)

I'd like to validate the OOP object structure. For instance, what will we do when we'll run trials in parallel? This for loop is not enough, it'd be more like a pool of workers that tries the N next best samples.

We also need a way to indicate that the trial crashed so that the auto_ml_strategy doesn't try that point again.

Any comments/suggestions on that @mlevesquedion @alexbrillant @Eric2Hamel?

Broken Pipeline Runner causes infinite recursion.

The default pipeline runner, an argument of the Pipeline class' constructor, is reused across different Pipeline instances as the default argument is created only once. This sometimes causes a recursion error when the pipeline runner gets its steps setted everywhere at once with set_steps and loops on itself.

Quick fix: do a copy of the pipeline runner in the constructor of the Pipeline as such:
self.pipeline_runner: BasePipelineRunner = copy(pipeline_runner)

Better ways to fix that may be possible.

StepClonerForEachDataInput doesn't propagate hyperparams.

This is the same problem as described in #28, however here it is about the StepClonerForEachDataInput: it doesn't get_hyperparams not set_hyperparams to its contained pipelines. It'd be good here to not return many duplicates of the same params and have something that upon a get, gets oinly the hyperparams of one instance, and upon a set, sets the same hyperparams for each instance.

Same goes for spaces with set_hyperparams_space.

Need a better common base class for meta steps (that handles `get_hyperparams` and `set_hyperparams`)

Problem:

  • MetaStep and MetaSteps doesn't implement get_hyperparams nor set_hyperparams.
  • TruncableSteps does.

What should be done about it:

  • Move some logic from TruncableSteps to MetaSteps
  • Have MetaStep act the same, probably by inheriting MetaSteps but setting only one such meta step.

Other better solutions could probably be possible. Basically, we need not only to have nested (recursive) pipelines to be able to return their hyperparams, but also nested objects that are MetaStep(s). MetaStep(s) do contain other step(s) and should be able to get and set their hyperparams recursively as done in TruncableSteps.

@alexbrillant I'd like your thoughts on this.

Add BaseStep.config, BaseStep.get_config() and BaseStep.set_config()

The concept is the same as for the Hyperparameter Samples and the Hyperparameter Spaces. But a config shouldn't change what happens to the data, just how it is treated (e.g.: number of cores).

It's interesting to move some parameters to a config for when those parameters are system-related or misc. We don't want some of those parameters to alter the hashes (e.g.: n_jobs in FeatureUnion shouldn't change the outcome and should be modified to be such a config parameter).

Implement Pipeline Setup And Teardown Methods

Add setup and teardown methods to base step.
Setup : Recursively call setup methods of each pipeline steps at the beginning of the execution.
Teardown : Will be used to close session, connections, etc. at the end of the execution

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.