mlbazaar / mlblocks Goto Github PK
View Code? Open in Web Editor NEWA library for composing end-to-end tunable machine learning pipelines.
Home Page: https://mlbazaar.github.io/MLBlocks
License: MIT License
A library for composing end-to-end tunable machine learning pipelines.
Home Page: https://mlbazaar.github.io/MLBlocks
License: MIT License
Add a multitable dataset
The directories mlblocks/components/pipelines
and examples
should be removed from this repository and moved into the MLBlocks-Demos
one, still to be created.
Introduce a way to extract intermediate outputs from a Pipeline without having to rebuild it and refit it.
The functionality should be something along this lines:
primitives = ['a', 'b', 'c', 'd']
pipeline = MLPipeline(['a', 'b', 'c', 'd'])
pipeline.fit(train_X, train_y)
output_of_b = pipeline.predict(train_X, output_='b')
Some elements in tox config are not required and can be cleaned-up
Add more datasets to cover a broader scope of data modalities and task types:
data_modality | task_type | done | dataset |
---|---|---|---|
audio | classification | no | ESC-50 |
graph | communityDetection | no | 6_70_com_amazon |
graph | graphMatching | no | LL1_DIC28_net |
graph | linkPrediction | yes | UMLS |
graph | vertexNomination | no | LL1_net_nomination_seed |
image | classification | yes | USPS |
image | regression | yes | HandGeometry |
single_table | classification | yes | Iris |
single_table | collaborativeFiltering | no | 60_jester |
single_table | regression | yes | Boston |
text | classification | no | Personae |
timeseries | classification | no | LL1_Trace |
Add a JSON keyword to specify dependencies or requirements and add a validation of those during the MLBlock instantiation.
If a dependency is missing or an incompatible version is installed, a user friendly exception should be raised when the MLBlock is created asking the user to install the required dependency.
Implement a way to save and load an MLPipeline to and from a JSON file.
The JSON file structure would contain the __init__
arguments of the MLPipeline, as well as the hyperparameters
and tunable_hyperparameters
:
{
"primitives": [
"a_primitive",
"another_primitive"
],
"init_params": {
"a_primitive": {
"an_argument": "a_value"
}
},
"hyperparameters": {
"a_primitive#1": {
"an_argument": "a_value",
"another_argument": "another_value",
},
"another_primitive#1": {
"yet_another_argument": "yet_another_value"
}
},
"tunable_hyperparameters": {
"another_primitive#1": {
"yet_another_argument": {
"type": "str",
"default": "a_default_value",
"values": [
"a_default_value",
"yet_another_value"
]
}
}
}
}
This will allow two things:
Make MLPipeline accept a list of block names in the __init__
call.
Also, make all the MLPipeline subclasses work with __init__
instead of __new__
Currently, the block type is its name, so there cannot be two instances of the same block in a single pipeline, because the second one would overwrite the first one.
To avoid this, the block names should automatically get some unique identifier, like a counter, in their name when they are added to the pipeline.
All the pipelines parameters, init_params
, fit_params
, etc. are expected to be a dict with the format {(block_name, param_name): param_value}
as input.
This format is not JSON serializable because of the tuple, and needs to be converted to a different format inside the MLPipeline in order to interact with the MLBlocks.
I suggest using nested dictionaries, as in {block_name: {param_name: param_value}}
as input, which will be JSON serializable (as far as the value also is), and will greatly simplify the internal MLPipeline implementation.
Add command to find and run all example modules to Makefile.
This will allow us to run the examples as an interim health-check while we do not have proper unit tests in place.
Currently it is not possible to run the examples as a tox stage because examples behavior is not deterministic and sometimes they crash.
It looks like the usage of ml_json_parser can be simplified to depend on fewer methods and not having to assign methods as attributes on instance creation.
All the test should be covered with Unit Test.
This issue should be used to give an initial 100% coverage before we openly request external contribution to the project.
The data type of all the datasets is numpy arrays or pandas DataFrames except for News Groups, which is a list.
Return the data as a numpy array, to make the dataset interface uniform.
The current version of MLBlocks is preparing the keyword arguments for the MLBlock, including keyword replacement and the lookup for a default value if an argument is missing inside the MLPipeline.
The consequence of this is that if a primitive has defined an argument name different than the keyword
expected by the actual primitive method, the MLBlock instance expects to be called with the actual primitive keyword instead of the one defined in the primitive JSON.
An example of this can be seen in the numpy.argmax
primitive:
The JSON contains:
"produce": {
"args": [
{
"name": "y",
"keyword": "a",
"type": "ndarray"
}
],
...
}
As a consequence of this, the block is exposing the argument name as y
:
>>> block = MLBlock('numpy.argmax')
>>> block.produce_args
[{'name': 'y', 'keyword': 'a', 'type': 'ndarray'}]
But, since the keyword
argument is not being parsed by the MLBlock class itself, if we call it using the y
keyword it fails:
>>> block.produce(y=[[1, 2, 3]])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xals/Projects/MIT/MLBlocks/mlblocks/mlblock.py", line 310, in produce
return self.primitive(**produce_kwargs)
TypeError: argmax() got an unexpected keyword argument 'y'
And the same would happen if there was a default value defined.
To fix this, we can move the parsing logic from the MLPipeline to the MLBlock, so that the block methods can also be called with the exposed argument names and also have default values.
There should be the possibility to re-fit only a part of the pipeline, keeping the rest of the blocks as they are.
This is a discussion issue as a followup #71
Should we natively support returning a class attribute value (or multiple) instead of calling a method of the class during the produce phase?
This could be simple achieved at the JSON level by simply not specifying the method
key.
Then, we would have:
"produce": {
"output": [
{
"name": "feature_weights",
"attribute": "feature_importances_" # This would be optional, to be used only if the used name does not match the attribute name.
"type": "list"
}
]
}
I'm trying to use MLBlock and not MLPipeline and everything was fine, but whenever I want to fit my model the code crashes.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLBlock
block = MLBlock('sklearn.svm.SVC')
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=7)
block.fit(X_train, y_train)
#Traceback:
Traceback (most recent call last):
File "<ipython-input-22-7c1505067a70>", line 1, in <module>
runfile('/Users/najat/Documents/GitHub/Cardea/cardea/modeling/modeling.py', wdir='/Users/najat/Documents/GitHub/Cardea/cardea/modeling')
File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/najat/Documents/GitHub/Cardea/cardea/modeling/modeling.py", line 30, in <module>
block.fit(X_train, y_train)
TypeError: fit() takes 1 positional argument but 3 were given
One approach is to organize them by problem type (image, audio, text, etc.).
Another approach, which can be combined, is organizing them by stages: pre-processing, predicting, etc.
SimpleCnn and LstmText examples crash from time to time.
It seems to be triggered for some random values used internally.
I'm running a Random Forest Classifier algorithm and I'm trying to tune the algorithm hyper-parameter. However, after inspecting the code it seems that the algorithm uses the default values only and their is no automatic tuning is happening. Can you please tell me what should I do exactly in order to invoke and tune the hyper-parameter.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLPipeline
from mlblocks import MLBlock
from sklearn.metrics import accuracy_score as score
primitive = [
'sklearn.ensemble.RandomForestClassifier',
]
pipeline_for_one = MLPipeline(primitive)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2,random_state=7)
pipeline_for_one.fit(X_train, y_train)
y_pred = pipeline_for_one.predict(X_test)
x = pipeline_for_one.blocks
xx = x.popitem()
b = xx[1]
b.instance
'''
the output was: (all default values)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=10, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=0.1, min_samples_split=0.1,
min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=-1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)'''
Pipeline and Primitive JSON Schema should be added to the project along with a tool that allows validating whether a primitive or pipeline annotation is valid to be used with MLBlocks.
When an MLBlock instance is created, if there is a conditional hyperparameter and the condition hyperparameter has been given as an init param, the conditional hyperparameter should be read as a regular tunable hyperparameter.
An example of this would be a primitive with the following hyperparameters:
"hyperparameters": {
"foo": {
"type": "str",
"values": ["a", "b"]
},
"bar": {
"type": "conditional",
"condition": "foo",
"values": {
"a": {
"type": "int",
"range": [1, 10]
"b": {
"type": "float",
"range": [0.0, 0.1]
}
}
}
}
That is instantiated as:
pipeline = MLPipeline(['a_primitive'], {'a_primitive': {'foo': 'a'}})
In this case, calling the get_tunable_hyperparameters
should return:
{
"bar": {
"type": "int",
"range": [1, 10]
}
}
Update documentation pages.
A good reference is the BTB documentation.
I was walking through the tutorial and everything was working just fine, however after I called the pipeline.fit() function a TypeError was raised .
#The commands I ran, before the crash.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLPipeline
import pandas as pd
primitives = [
'sklearn.preprocessing.StandardScaler',
'sklearn.ensemble.RandomForestClassifier'
]
pipeline = MLPipeline(primitives)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2,random_state=7)
pipeline.fit(X_train, y_train)
Traceback (most recent call last):
File "<ipython-input-269-94584d36b1c1>", line 1, in <module>
runfile('/Users/najat/Documents/MLBlocks/mlblocks_test.py', wdir='/Users/najat/Documents/MLBlocks')
File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/najat/Documents/MLBlocks/mlblocks_test.py", line 23, in <module>
pipeline.fit(X_train, y_train)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/mlblocks/mlpipeline.py", line 139, in fit
block.fit(**fit_args)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/mlblocks/mlblock.py", line 120, in fit
getattr(self.instance, self.fit_method)(**fit_args)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 590, in fit
return self.partial_fit(X, y)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 612, in partial_fit
warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
_assert_all_finite(array)
File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 41, in _assert_all_finite
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
File "/Users/najat/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py", line 35, in _sum
return umr_sum(a, axis, dtype, out, keepdims, initial)
TypeError: reduce() takes at most 5 arguments (6 given)
Implement pipeline loading by name and use entry points for pipeline discovery.
The API will work as follows:
primitives
argument will be made a keyword argument instead of a positional argument.pipeline
will be added to the MLPipeline
class in the first place.pipeline
argument will accept four different types:
str
it will be considered to be the pipeline name and it will be searched.list
it will be considered to be the primitives list.dict
it will be considered to be a complete pipeline specificationMLPipeline
instance it will be converted to a dict using its to_dict
method.For pipeline discovery, the mlpipelines
entry_point name will be used like the mlprimitives
one is currently being used.
For this, the module primitives will be renamed to discovery
and the methods will be remade to work for both pipelines and primitives.
Add logging statements in the code.
Also, find a way to show the name of the failing block when there is an exception within a pipeline.
Add functions to explore the available primitives and pipelines filtering by their metadata attributes.
Re-implement the Block JSON parsing and Pipeline creation to make it work with as little custom python code as possible by primitive. If possible, none at all.
Currently, the hyperparameters of each primitive are specified individually.
However, there are some scenarios where hyperparameters from two different blocks should match each other.
An example of this are the keras.preprocessing.sequence.pad_sequences
and keras.Sequential.LSTMTextClassifier
primitives: The pad_sequences
primitive has a maxlen
hyperparameter that should be exactly the same value that is given to the LSTMTextClassifier
as the input_length
hyperparameter.
Define a standard data type for MLBlocks (pandas or numpy), and provide a way to specify whether a primitive works with this data type or needs a data conversion before or after it is used.
Also, implement conversion functions between the most common data types.
During runtime, the system should use the helper functions to convert the data types back and forth when using primitives that do not work with the standard type.
Also, the system should be smart enough to skip unnecessary conversions when passing data between blocks that support the same data type.
The primitive JSON files should be looked for in at least two places:
./mlblocks_primitives
There should also be a method to add a custom folder where primitives should be taken from.
There should also be a dict somewhere in the package where the loaded primitives would be stored "in memory" to avoid having to load them from disk more than once.
Add a way to indicate that a primitive does not need to be applied to the whole input, but only to a part of it.
The selection of which columns to apply the primitive to should be done by either column name or index, and it should be possible to pass the selection in the live variables, allowing the implementation of detector primitives that do the selection in a previous step.
Travis was removed some days ago because the repository was private and Travis had no visibility over it, but now that it's public we can set it up and use it to create the gh-pages documentation.
The work to be done is:
Some primitives work column by column or row by row, so it should be possible to indicate that in the JSON and have this done without any additional custom code for each primitive.
Current implementation searches for the 'mlblocks_primitivesfolders in the
sys.prefix` folder (the virtualenv or python installation root folder) and the current working directory.
Instead of that, or in addition to that, entry_points
should be used to allow multiple primitives folders to be installed by different libraries.
This is a good explanation about how to achieve this: https://amir.rachum.com/blog/2017/07/28/python-entry-points/
The README example should show:
A new API for MLBlocks will be implemented.
These are the most relevant changes:
If the result of a produce
method is an empty dataset, the next primitive should not be called.
Since each JSON file being parsed represents an MLBlock, the parsing methods should be part of the MLBlock itself, or at least called from within its code.
I have been trying to produce the features importance from a classifier (RandomForrest). The feature_importances function doesn't take arguments. So, the MLPipeline doesn't execute (TypeError: 'numpy.ndarray' object is not callable).
I left the arguments empty in the primitive definition file in the produce part.
{
"name": "sklearn.ensemble.RandomForestClassifier1",
"documentation": "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html",
"description": "Scikit-learn RandomForestClassifier.",
"classifiers": {
"type": "estimator",
"subtype": "classifier"
},
"modalities": [],
"primitive": "sklearn.ensemble.RandomForestClassifier",
"fit": {
"method": "fit",
"args": [
{
"name": "X",
"type": "DataFrame"
},
{
"name": "y",
"type": "Series"
}
]
},
"produce": {
"method": "feature_importances_",
"args": [],
"output": [
{
"name": "y",
"type": "Series"
}
]
},
"hyperparameters": {
"fixed": {
"n_jobs": {
"type": "int",
"default": -1
}
},
"tunable": {
"criterion": {
"type": "str",
"default": "entropy",
"values": ["entropy", "gini"]
},
"max_features": {
"type": "str",
"default": null,
"range": [null, "auto", "log2"]
},
"max_depth": {
"type": "int",
"default": 10,
"range": [1, 30]
},
"min_samples_split": {
"type": "float",
"default": 0.1,
"range": [0.0001, 0.5]
},
"min_samples_leaf": {
"type": "float",
"default": 0.1,
"range": [0.0001, 0.5]
},
"n_estimators": {
"type": "int",
"default": 30,
"values": [2, 500]
},
"class_weight": {
"type": "str",
"default": null,
"range": [null, "balanced"]
}
}
}
}
The current implementation does not isolate the MLBlock hyperparameters dictionary from the underlying primitive, allowing this to modify its content and leading to unexpected behaviors and bugs.
This isolation should be enforced by using deepcopy instead of a simple copy when returning the hyperparameters in the get_hyperparameters
method, and by always accessing the hyperparameters through this method instead of directly when passing them to the underlying primitive.
Add two new methods to the Dataset objects:
describe()
: Return a text description of the dataset contents.get_splits(splits={number of splits})
: Split the dataset for cross validation. If the number of splits is 1, use sklearn.model_selection.train_test_split
.MLBlocks hyperparameter specification format consists of a hierarchical tree of dictionaries where each key is a block name and each value is a dictionary with the complete hyperparameter specification for that block.
BTB, instead, specifies the hyperparameters as a flat dictionary where the keys are two element tuples containing the name of the block in the first place and the name of the hyperparameter in the second place, and where the values are the corresponding hyperparameter values.
We should add support for that format in the following ways:
MLPipeline.set_hyperparameters
will accept both flat and hierarchical formats as input. If a flat dictionary is passed, it will be converted to a hierarchical one.MLPipeline.get_hyperparameters
will accept a new argument, flat=False
. If set to True, a flat dictionary will be returned. Otherwise (default), the hierarchical one will be returned.MLPipeline.get_tunable_hyperparameters
will accept a new argument, flat=False
. If set to True, a flat dictionary will be returned. Otherwise (default), the hierarchical one will be returned.MLPrimitives has been updated with new primitives since the documentation was created.
A new multitable dataset has been included, which can be used to showcase multitable pipelines.
A section should be added as well to show how to save and load a pipeline as a JSON.
Sometimes it's needed to appended or merge the output of a primitive with its own input, or with some other variable.
Two possibilities exist for this:
X = pd.concat([X1, X2], axis=1)
A method should be included with functions to download data for testing from an S3 bucket.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.