mlbazaar / mlblocks Goto Github PK

View Code? Open in Web Editor NEW

112.0 14.0 35.0 7.97 MB

A library for composing end-to-end tunable machine learning pipelines.

Home Page: https://mlbazaar.github.io/MLBlocks

License: MIT License

Python 95.09% Makefile 4.91%

machine-learning hyperparameters pipelines primitives

mlblocks's People

Contributors

Stargazers

Watchers

mlblocks's Issues

Move pipelines and examples to a different repository

The directories mlblocks/components/pipelines and examples should be removed from this repository and moved into the MLBlocks-Demos one, still to be created.

Allow getting intermediate outputs

Introduce a way to extract intermediate outputs from a Pipeline without having to rebuild it and refit it.

The functionality should be something along this lines:

primitives = ['a', 'b', 'c', 'd']
pipeline = MLPipeline(['a', 'b', 'c', 'd'])
pipeline.fit(train_X, train_y)
output_of_b = pipeline.predict(train_X, output_='b')

Cleanup tox config

Some elements in tox config are not required and can be cleaned-up

Add more datasets

Add more datasets to cover a broader scope of data modalities and task types:

data_modality	task_type	done	dataset
audio	classification	no	ESC-50
graph	communityDetection	no	6_70_com_amazon
graph	graphMatching	no	LL1_DIC28_net
graph	linkPrediction	yes	UMLS
graph	vertexNomination	no	LL1_net_nomination_seed
image	classification	yes	USPS
image	regression	yes	HandGeometry
single_table	classification	yes	Iris
single_table	collaborativeFiltering	no	60_jester
single_table	regression	yes	Boston
text	classification	no	Personae
timeseries	classification	no	LL1_Trace

Add primitive requirements parsing and validation

Add a JSON keyword to specify dependencies or requirements and add a validation of those during the MLBlock instantiation.

If a dependency is missing or an incompatible version is installed, a user friendly exception should be raised when the MLBlock is created asking the user to install the required dependency.

Implement save/load

Implement a way to save and load an MLPipeline to and from a JSON file.

The JSON file structure would contain the __init__ arguments of the MLPipeline, as well as the hyperparameters and tunable_hyperparameters:

{
    "primitives": [
        "a_primitive",
        "another_primitive"
    ],
    "init_params": {
        "a_primitive": {
            "an_argument": "a_value"
        }
    },
    "hyperparameters": {
        "a_primitive#1": {
            "an_argument": "a_value",
            "another_argument": "another_value",
        },
        "another_primitive#1": {
            "yet_another_argument": "yet_another_value"
         }
    },
    "tunable_hyperparameters": {
        "another_primitive#1": {
            "yet_another_argument": {
                "type": "str",
                "default": "a_default_value",
                "values": [
                    "a_default_value",
                    "yet_another_value"
                ]
            }
        }
    }
}

This will allow two things:

Store and load the state of a tuned value without having to store the fitted instance.
Define a pipeline with a custom set of tunable hyperparameters, which might be different than the ones defined in the used primitives.

Refactor MLPipeline

Make MLPipeline accept a list of block names in the __init__ call.

Also, make all the MLPipeline subclasses work with __init__ instead of __new__

Allow multiple blocks with the same name

Currently, the block type is its name, so there cannot be two instances of the same block in a single pipeline, because the second one would overwrite the first one.

To avoid this, the block names should automatically get some unique identifier, like a counter, in their name when they are added to the pipeline.

Change the pipeline `*_params` format

All the pipelines parameters, init_params, fit_params, etc. are expected to be a dict with the format {(block_name, param_name): param_value} as input.

This format is not JSON serializable because of the tuple, and needs to be converted to a different format inside the MLPipeline in order to interact with the MLBlocks.

I suggest using nested dictionaries, as in {block_name: {param_name: param_value}} as input, which will be JSON serializable (as far as the value also is), and will greatly simplify the internal MLPipeline implementation.

Add command to run all examples to Makefile

Add command to find and run all example modules to Makefile.

This will allow us to run the examples as an interim health-check while we do not have proper unit tests in place.

Currently it is not possible to run the examples as a tox stage because examples behavior is not deterministic and sometimes they crash.

Cleanup DeepMining names

Simplify ml_json_parser

It looks like the usage of ml_json_parser can be simplified to depend on fewer methods and not having to assign methods as attributes on instance creation.

Allow using functions as primitives instead of requiring classes

Add Unit Tests

All the test should be covered with Unit Test.

This issue should be used to give an initial 100% coverage before we openly request external contribution to the project.

Return newsgroups data as numpy arrays instead of lists

The data type of all the datasets is numpy arrays or pandas DataFrames except for News Groups, which is a list.

Return the data as a numpy array, to make the dataset interface uniform.

Move argument parsing to MLBlock

The current version of MLBlocks is preparing the keyword arguments for the MLBlock, including keyword replacement and the lookup for a default value if an argument is missing inside the MLPipeline.

The consequence of this is that if a primitive has defined an argument name different than the keyword expected by the actual primitive method, the MLBlock instance expects to be called with the actual primitive keyword instead of the one defined in the primitive JSON.

An example of this can be seen in the numpy.argmax primitive:

The JSON contains:

    "produce": {
        "args": [
            {
                "name": "y",
                "keyword": "a",
                "type": "ndarray"
            }
        ],
        ...
    }

As a consequence of this, the block is exposing the argument name as y:

>>> block = MLBlock('numpy.argmax')
>>> block.produce_args
[{'name': 'y', 'keyword': 'a', 'type': 'ndarray'}]

But, since the keyword argument is not being parsed by the MLBlock class itself, if we call it using the y keyword it fails:

>>> block.produce(y=[[1, 2, 3]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xals/Projects/MIT/MLBlocks/mlblocks/mlblock.py", line 310, in produce
    return self.primitive(**produce_kwargs)
TypeError: argmax() got an unexpected keyword argument 'y'

And the same would happen if there was a default value defined.

To fix this, we can move the parsing logic from the MLPipeline to the MLBlock, so that the block methods can also be called with the exposed argument names and also have default values.

Implement partial re-fit

There should be the possibility to re-fit only a part of the pipeline, keeping the rest of the blocks as they are.

Remove CustomCountVectorizer (not needed)

Support reading an attribute from a class instead of calling a method?

This is a discussion issue as a followup #71

Should we natively support returning a class attribute value (or multiple) instead of calling a method of the class during the produce phase?

This could be simple achieved at the JSON level by simply not specifying the method key.

Then, we would have:

    "produce": {
        "output": [
            {
                "name": "feature_weights",
                "attribute": "feature_importances_"   # This would be optional, to be used only if the used name does not match the attribute name.
                "type": "list"
            }
        ]
    }

MlBlock fit function

MLBlocks version: 0.2.0
Python version: 3.6
Operating System: MAC OS

Description

I'm trying to use MLBlock and not MLPipeline and everything was fine, but whenever I want to fit my model the code crashes.

What I Did

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLBlock

block =  MLBlock('sklearn.svm.SVC')
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,  test_size=0.2, random_state=7)

block.fit(X_train, y_train)

#Traceback:

Traceback (most recent call last):

  File "<ipython-input-22-7c1505067a70>", line 1, in <module>
    runfile('/Users/najat/Documents/GitHub/Cardea/cardea/modeling/modeling.py', wdir='/Users/najat/Documents/GitHub/Cardea/cardea/modeling')

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/najat/Documents/GitHub/Cardea/cardea/modeling/modeling.py", line 30, in <module>
    block.fit(X_train, y_train)

TypeError: fit() takes 1 positional argument but 3 were given

Re-organize JSON primitives in a folder tree by their nature

One approach is to organize them by problem type (image, audio, text, etc.).

Another approach, which can be combined, is organizing them by stages: pre-processing, predicting, etc.

Keras examples crash randomly

SimpleCnn and LstmText examples crash from time to time.

It seems to be triggered for some random values used internally.

Auto hyper-parameter tuning

MLBlocks version:
Python version:
Operating System:

Description

I'm running a Random Forest Classifier algorithm and I'm trying to tune the algorithm hyper-parameter. However, after inspecting the code it seems that the algorithm uses the default values only and their is no automatic tuning is happening. Can you please tell me what should I do exactly in order to invoke and tune the hyper-parameter.

What I Did

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLPipeline
from mlblocks import MLBlock
from sklearn.metrics import accuracy_score as score


primitive = [
     'sklearn.ensemble.RandomForestClassifier',
  ]

pipeline_for_one = MLPipeline(primitive)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,  test_size=0.2,random_state=7)

pipeline_for_one.fit(X_train, y_train)
y_pred = pipeline_for_one.predict(X_test)

x = pipeline_for_one.blocks
xx = x.popitem()
b = xx[1]
b.instance
'''
the output was: (all default values)
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=10, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=0.1, min_samples_split=0.1,
            min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)'''

Add primitive and pipeline JSON schema and validation

Pipeline and Primitive JSON Schema should be added to the project along with a tool that allows validating whether a primitive or pipeline annotation is valid to be used with MLBlocks.

Filter conditional hyperparameters during MLBlock initialization

When an MLBlock instance is created, if there is a conditional hyperparameter and the condition hyperparameter has been given as an init param, the conditional hyperparameter should be read as a regular tunable hyperparameter.

An example of this would be a primitive with the following hyperparameters:

"hyperparameters": {
    "foo": {
        "type": "str",
        "values": ["a", "b"]
    },
    "bar": {
        "type": "conditional",
        "condition": "foo",
        "values": {
            "a": {
                "type": "int",
                "range": [1, 10]
            "b": {
                "type": "float",
                "range": [0.0, 0.1]
             }
        }
    }
}

That is instantiated as:

pipeline = MLPipeline(['a_primitive'], {'a_primitive': {'foo': 'a'}})

In this case, calling the get_tunable_hyperparameters should return:

{
    "bar": {
        "type": "int",
        "range": [1, 10]
    }
}

Add documentation

Update documentation pages.

A good reference is the BTB documentation.

Problem with pipeline.fit()

MLBlocks version: 0.2.0
Python version: 3.6.5
Operating System: macOS

Description

I was walking through the tutorial and everything was working just fine, however after I called the pipeline.fit() function a TypeError was raised .

What I Did

#The commands I ran, before the crash.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLPipeline
import pandas as pd

primitives = [
     'sklearn.preprocessing.StandardScaler',
    'sklearn.ensemble.RandomForestClassifier'
 ]
pipeline = MLPipeline(primitives)

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,  test_size=0.2,random_state=7)

pipeline.fit(X_train, y_train)


Traceback (most recent call last):

  File "<ipython-input-269-94584d36b1c1>", line 1, in <module>
    runfile('/Users/najat/Documents/MLBlocks/mlblocks_test.py', wdir='/Users/najat/Documents/MLBlocks')

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/najat/Documents/MLBlocks/mlblocks_test.py", line 23, in <module>
    pipeline.fit(X_train, y_train)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/mlblocks/mlpipeline.py", line 139, in fit
    block.fit(**fit_args)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/mlblocks/mlblock.py", line 120, in fit
    getattr(self.instance, self.fit_method)(**fit_args)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 590, in fit
    return self.partial_fit(X, y)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 612, in partial_fit
    warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 41, in _assert_all_finite
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py", line 35, in _sum
    return umr_sum(a, axis, dtype, out, keepdims, initial)

TypeError: reduce() takes at most 5 arguments (6 given)

Load pipelines by name and register them as entry_points

Implement pipeline loading by name and use entry points for pipeline discovery.

The API will work as follows:

The primitives argument will be made a keyword argument instead of a positional argument.
A new keyword argument pipeline will be added to the MLPipeline class in the first place.
The pipeline argument will accept four different types:
- It it is an str it will be considered to be the pipeline name and it will be searched.
- If it is a list it will be considered to be the primitives list.
- If it is a dict it will be considered to be a complete pipeline specification
- If it is an MLPipeline instance it will be converted to a dict using its to_dict method.

For pipeline discovery, the mlpipelines entry_point name will be used like the mlprimitives one is currently being used.
For this, the module primitives will be renamed to discovery and the methods will be remade to work for both pipelines and primitives.

Improve Logging

Add logging statements in the code.

Also, find a way to show the name of the failing block when there is an exception within a pipeline.

Add functions to explore the available primitives and pipelines

Add functions to explore the available primitives and pipelines filtering by their metadata attributes.

Make primitives work with as little custom python code as possible

Re-implement the Block JSON parsing and Pipeline creation to make it work with as little custom python code as possible by primitive. If possible, none at all.

Implement shared hyperameters

Currently, the hyperparameters of each primitive are specified individually.
However, there are some scenarios where hyperparameters from two different blocks should match each other.

An example of this are the keras.preprocessing.sequence.pad_sequences and keras.Sequential.LSTMTextClassifier primitives: The pad_sequences primitive has a maxlen hyperparameter that should be exactly the same value that is given to the LSTMTextClassifier as the input_length hyperparameter.

Add data type specification in JSON primitives and conversion functions

Define a standard data type for MLBlocks (pandas or numpy), and provide a way to specify whether a primitive works with this data type or needs a data conversion before or after it is used.

Also, implement conversion functions between the most common data types.

During runtime, the system should use the helper functions to convert the data types back and forth when using primitives that do not work with the standard type.
Also, the system should be smart enough to skip unnecessary conversions when passing data between blocks that support the same data type.

Add primitive caching

The primitive JSON files should be looked for in at least two places:

the package primitives directory
a custom user directory, such as ./mlblocks_primitives

There should also be a method to add a custom folder where primitives should be taken from.

There should also be a dict somewhere in the package where the loaded primitives would be stored "in memory" to avoid having to load them from disk more than once.

Implement a native way to selectively apply a primitive to only some columns

Add a way to indicate that a primitive does not need to be applied to the whole input, but only to a part of it.

The selection of which columns to apply the primitive to should be done by either column name or index, and it should be possible to pass the selection in the live variables, allowing the implementation of detector primitives that do the selection in a previous step.

Integrate Travis and set the Gh-pages

Travis was removed some days ago because the repository was private and Travis had no visibility over it, but now that it's public we can set it up and use it to create the gh-pages documentation.

The work to be done is:

Add travis configuration to the repo and set the project up in travis
Configure the repository to have the documentation served from gh-pages branch

Have the possibility to apply a primitive along an axis

Some primitives work column by column or row by row, so it should be possible to indicate that in the JSON and have this done without any additional custom code for each primitive.

Use entry_points for primitive discovery

Current implementation searches for the 'mlblocks_primitivesfolders in thesys.prefix` folder (the virtualenv or python installation root folder) and the current working directory.

Instead of that, or in addition to that, entry_points should be used to allow multiple primitives folders to be installed by different libraries.

This is a good explanation about how to achieve this: https://amir.rachum.com/blog/2017/07/28/python-entry-points/

Improve README example

The README example should show:

Multiple libraries
A pipeline save and load cycle
The usage of tunable hyperparameters

New MLBlocks API

A new API for MLBlocks will be implemented.

These are the most relevant changes:

Primitives JSONs and Python code has been moved to a different repository, called MLPrimitives
Optional usage of multiple JSON primitive folders.
JSON format has been changed to allow more flexibility and features:
- input and output arguments, as well as argument types, can be specified for each method
- both classes and function as primitives are supported
- multitype and conditional hyperparameters fully supported
- data modalities and primitive classifiers introduced
- metadata such as documentation, description and author fields added
Parsers are removed, and now the MLBlock class is responsible for loading and reading the JSON primitive.
Multiple blocks of the same primitive are supported within the same pipeline.
Arbitrary inputs and outputs for both pipelines and blocks are allowed.
Shared variables during pipeline execution, usable by multiple blocks.

Skip fit in Pipeline if dataset is empty

If the result of a produce method is an empty dataset, the next primitive should not be called.

Make parsers part of MLBlock

Since each JSON file being parsed represents an MLBlock, the parsing methods should be part of the MLBlock itself, or at least called from within its code.

Extracting features importance

MLBlocks version: '0.2.0'
Python version: Python 3.6.3
Operating System: MacOS High Sierra

Description

I have been trying to produce the features importance from a classifier (RandomForrest). The feature_importances function doesn't take arguments. So, the MLPipeline doesn't execute (TypeError: 'numpy.ndarray' object is not callable).

What I Did

I left the arguments empty in the primitive definition file in the produce part.


{
    "name": "sklearn.ensemble.RandomForestClassifier1",
    "documentation": "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html",
    "description": "Scikit-learn RandomForestClassifier.",
    "classifiers": {
        "type": "estimator",
        "subtype": "classifier"
    },
    "modalities": [],
    "primitive": "sklearn.ensemble.RandomForestClassifier",
    "fit": {
        "method": "fit",
        "args": [
            {
                "name": "X",
                "type": "DataFrame"
            },
            {
                "name": "y",
                "type": "Series"
            }
        ]
    },
    "produce": {
        "method": "feature_importances_",
        "args": [],
        "output": [
            {
                "name": "y",
                "type": "Series"
            }
        ]
    },
    "hyperparameters": {
        "fixed": {
            "n_jobs": {
                "type": "int",
                "default": -1
            }
        },
        "tunable": {
            "criterion": {
                "type": "str",
                "default": "entropy",
                "values": ["entropy", "gini"]
            },
            "max_features": {
                "type": "str",
                "default": null,
                "range": [null, "auto", "log2"]
            },
            "max_depth": {
                "type": "int",
                "default": 10,
                "range": [1, 30]
            },
            "min_samples_split": {
                "type": "float",
                "default": 0.1,
                "range": [0.0001, 0.5]
            },
            "min_samples_leaf": {
                "type": "float",
                "default": 0.1,
                "range": [0.0001, 0.5]
            },
            "n_estimators": {
                "type": "int",
                "default": 30,
                "values": [2, 500]
            },
            "class_weight": {
                "type": "str",
                "default": null,
                "range": [null, "balanced"]
            }
        }
    }
}

Isolate primitives from their hyperparameters dictionary

The current implementation does not isolate the MLBlock hyperparameters dictionary from the underlying primitive, allowing this to modify its content and leading to unexpected behaviors and bugs.

This isolation should be enforced by using deepcopy instead of a simple copy when returning the hyperparameters in the get_hyperparameters method, and by always accessing the hyperparameters through this method instead of directly when passing them to the underlying primitive.

Add description and CV methods to Datasets

Add two new methods to the Dataset objects:

describe(): Return a text description of the dataset contents.
get_splits(splits={number of splits}): Split the dataset for cross validation. If the number of splits is 1, use sklearn.model_selection.train_test_split.

Support flat hyperparameter dictionaries

MLBlocks hyperparameter specification format consists of a hierarchical tree of dictionaries where each key is a block name and each value is a dictionary with the complete hyperparameter specification for that block.

BTB, instead, specifies the hyperparameters as a flat dictionary where the keys are two element tuples containing the name of the block in the first place and the name of the hyperparameter in the second place, and where the values are the corresponding hyperparameter values.

We should add support for that format in the following ways:

The internal and main format will continue to be the hierarchical one.
MLPipeline.set_hyperparameters will accept both flat and hierarchical formats as input. If a flat dictionary is passed, it will be converted to a hierarchical one.
MLPipeline.get_hyperparameters will accept a new argument, flat=False. If set to True, a flat dictionary will be returned. Otherwise (default), the hierarchical one will be returned.
MLPipeline.get_tunable_hyperparameters will accept a new argument, flat=False. If set to True, a flat dictionary will be returned. Otherwise (default), the hierarchical one will be returned.

Update documentation with latest mlprimitives and datasets

MLPrimitives has been updated with new primitives since the documentation was created.

A new multitable dataset has been included, which can be used to showcase multitable pipelines.

A section should be added as well to show how to save and load a pipeline as a JSON.

Add the possibility to append data to a variable

Sometimes it's needed to appended or merge the output of a primitive with its own input, or with some other variable.

Two possibilities exist for this:

Add a way to indicate that the output needs to be appended to the input. This way, for example, a primitive can add columns to X instead of replacing it.
Add a native primitive that allows the concatenation of multiple variables: X = pd.concat([X1, X2], axis=1)

Add a demo data module

A method should be included with functions to download data for testing from an S3 bucket.

mlbazaar / mlblocks Goto Github PK

mlblocks's People

Contributors

Stargazers

Watchers

Forkers

mlblocks's Issues

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org