Coder Social home page Coder Social logo

mlbazaar / mlblocks Goto Github PK

View Code? Open in Web Editor NEW
111.0 14.0 35.0 7.97 MB

A library for composing end-to-end tunable machine learning pipelines.

Home Page: https://mlbazaar.github.io/MLBlocks

License: MIT License

Python 95.09% Makefile 4.91%
machine-learning hyperparameters pipelines primitives

mlblocks's Introduction

DAI-Lab An Open Source Project from the Data to AI Lab, at MIT

“MLBlocks”

Pipelines and Primitives for Machine Learning and Data Science.

Development Status PyPi Tests CodeCov Downloads Binder


MLBlocks

Overview

MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface.

Features include:

  • Build Machine Learning Pipelines combining any Machine Learning Library in Python.
  • Access a repository with hundreds of primitives and pipelines ready to be used with little to no python code to write, carefully curated by Machine Learning and Domain experts.
  • Extract machine-readable information about which hyperparameters can be tuned and within which ranges, allowing automated integration with Hyperparameter Optimization tools like BTB.
  • Complex multi-branch pipelines and DAG configurations, with unlimited number of inputs and outputs per primitive.
  • Easy save and load Pipelines using JSON Annotations.

Install

Requirements

MLBlocks has been developed and tested on Python 3.6, 3.7, 3.8, 3.9, and 3.10

Install with pip

The easiest and recommended way to install MLBlocks is using pip:

pip install mlblocks

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

MLPrimitives

In order to be usable, MLBlocks requires a compatible primitives library.

The official library, required in order to follow the following MLBlocks tutorial, is MLPrimitives, which you can install with this command:

pip install mlprimitives

Quickstart

Below there is a short example about how to use MLBlocks to solve the Adult Census Dataset classification problem using a pipeline which combines primitives from MLPrimitives, scikit-learn and xgboost.

import pandas as pd
from mlblocks import MLPipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

dataset = pd.read_csv('http://mlblocks.s3.amazonaws.com/census.csv')
label = dataset.pop('label')

X_train, X_test, y_train, y_test = train_test_split(dataset, label, stratify=label)

primitives = [
    'mlprimitives.custom.preprocessing.ClassEncoder',
    'mlprimitives.custom.feature_extraction.CategoricalEncoder',
    'sklearn.impute.SimpleImputer',
    'xgboost.XGBClassifier',
    'mlprimitives.custom.preprocessing.ClassDecoder'
]
pipeline = MLPipeline(primitives)

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

accuracy_score(y_test, predictions)

What's Next?

If you want to learn more about how to tune the pipeline hyperparameters, save and load the pipelines using JSON annotations or build complex multi-branched pipelines, please check our documentation site.

Also do not forget to have a look at the notebook tutorials!

Citing MLBlocks

If you use MLBlocks for your research, please consider citing our related papers.

For the current design of MLBlocks and its usage within the larger Machine Learning Bazaar project at the MIT Data To AI Lab, please see:

Micah J. Smith, Carles Sala, James Max Kanter, and Kalyan Veeramachaneni. "The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development." arXiv Preprint 1905.08942. 2019.

@article{smith2019mlbazaar,
  author = {Smith, Micah J. and Sala, Carles and Kanter, James Max and Veeramachaneni, Kalyan},
  title = {The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development},
  journal = {arXiv e-prints},
  year = {2019},
  eid = {arXiv:1905.08942},
  pages = {arXiv:1905.08942},
  archivePrefix = {arXiv},
  eprint = {1905.08942},
}

For the first MLBlocks version from 2015, designed for only multi table, multi entity temporal data, please refer to Bryan Collazo’s thesis:

With recent availability of a multitude of libraries and tools, we decided it was time to integrate them and expand the library to address other data types: images, text, graph, time series and integrate with deep learning libraries.

mlblocks's People

Contributors

csala avatar erica-chiu avatar jdtheripperpc avatar kveerama avatar lauragustafson avatar manuelalvarezc avatar pvk-developer avatar sarahmish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlblocks's Issues

MlBlock fit function

  • MLBlocks version: 0.2.0
  • Python version: 3.6
  • Operating System: MAC OS

Description

I'm trying to use MLBlock and not MLPipeline and everything was fine, but whenever I want to fit my model the code crashes.

What I Did

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLBlock

block =  MLBlock('sklearn.svm.SVC')
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,  test_size=0.2, random_state=7)

block.fit(X_train, y_train)

#Traceback:

Traceback (most recent call last):

  File "<ipython-input-22-7c1505067a70>", line 1, in <module>
    runfile('/Users/najat/Documents/GitHub/Cardea/cardea/modeling/modeling.py', wdir='/Users/najat/Documents/GitHub/Cardea/cardea/modeling')

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/najat/Documents/GitHub/Cardea/cardea/modeling/modeling.py", line 30, in <module>
    block.fit(X_train, y_train)

TypeError: fit() takes 1 positional argument but 3 were given

New MLBlocks API

A new API for MLBlocks will be implemented.

These are the most relevant changes:

  • Primitives JSONs and Python code has been moved to a different repository, called MLPrimitives
  • Optional usage of multiple JSON primitive folders.
  • JSON format has been changed to allow more flexibility and features:
    • input and output arguments, as well as argument types, can be specified for each method
    • both classes and function as primitives are supported
    • multitype and conditional hyperparameters fully supported
    • data modalities and primitive classifiers introduced
    • metadata such as documentation, description and author fields added
  • Parsers are removed, and now the MLBlock class is responsible for loading and reading the JSON primitive.
  • Multiple blocks of the same primitive are supported within the same pipeline.
  • Arbitrary inputs and outputs for both pipelines and blocks are allowed.
  • Shared variables during pipeline execution, usable by multiple blocks.

Extracting features importance

  • MLBlocks version: '0.2.0'
  • Python version: Python 3.6.3
  • Operating System: MacOS High Sierra

Description

I have been trying to produce the features importance from a classifier (RandomForrest). The feature_importances function doesn't take arguments. So, the MLPipeline doesn't execute (TypeError: 'numpy.ndarray' object is not callable).

What I Did

I left the arguments empty in the primitive definition file in the produce part.


{
    "name": "sklearn.ensemble.RandomForestClassifier1",
    "documentation": "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html",
    "description": "Scikit-learn RandomForestClassifier.",
    "classifiers": {
        "type": "estimator",
        "subtype": "classifier"
    },
    "modalities": [],
    "primitive": "sklearn.ensemble.RandomForestClassifier",
    "fit": {
        "method": "fit",
        "args": [
            {
                "name": "X",
                "type": "DataFrame"
            },
            {
                "name": "y",
                "type": "Series"
            }
        ]
    },
    "produce": {
        "method": "feature_importances_",
        "args": [],
        "output": [
            {
                "name": "y",
                "type": "Series"
            }
        ]
    },
    "hyperparameters": {
        "fixed": {
            "n_jobs": {
                "type": "int",
                "default": -1
            }
        },
        "tunable": {
            "criterion": {
                "type": "str",
                "default": "entropy",
                "values": ["entropy", "gini"]
            },
            "max_features": {
                "type": "str",
                "default": null,
                "range": [null, "auto", "log2"]
            },
            "max_depth": {
                "type": "int",
                "default": 10,
                "range": [1, 30]
            },
            "min_samples_split": {
                "type": "float",
                "default": 0.1,
                "range": [0.0001, 0.5]
            },
            "min_samples_leaf": {
                "type": "float",
                "default": 0.1,
                "range": [0.0001, 0.5]
            },
            "n_estimators": {
                "type": "int",
                "default": 30,
                "values": [2, 500]
            },
            "class_weight": {
                "type": "str",
                "default": null,
                "range": [null, "balanced"]
            }
        }
    }
}

Load pipelines by name and register them as entry_points

Implement pipeline loading by name and use entry points for pipeline discovery.

The API will work as follows:

  • The primitives argument will be made a keyword argument instead of a positional argument.
  • A new keyword argument pipeline will be added to the MLPipeline class in the first place.
  • The pipeline argument will accept four different types:
    • It it is an str it will be considered to be the pipeline name and it will be searched.
    • If it is a list it will be considered to be the primitives list.
    • If it is a dict it will be considered to be a complete pipeline specification
    • If it is an MLPipeline instance it will be converted to a dict using its to_dict method.

For pipeline discovery, the mlpipelines entry_point name will be used like the mlprimitives one is currently being used.
For this, the module primitives will be renamed to discovery and the methods will be remade to work for both pipelines and primitives.

Add description and CV methods to Datasets

Add two new methods to the Dataset objects:

  • describe(): Return a text description of the dataset contents.
  • get_splits(splits={number of splits}): Split the dataset for cross validation. If the number of splits is 1, use sklearn.model_selection.train_test_split.

Implement shared hyperameters

Currently, the hyperparameters of each primitive are specified individually.
However, there are some scenarios where hyperparameters from two different blocks should match each other.

An example of this are the keras.preprocessing.sequence.pad_sequences and keras.Sequential.LSTMTextClassifier primitives: The pad_sequences primitive has a maxlen hyperparameter that should be exactly the same value that is given to the LSTMTextClassifier as the input_length hyperparameter.

Simplify ml_json_parser

It looks like the usage of ml_json_parser can be simplified to depend on fewer methods and not having to assign methods as attributes on instance creation.

Refactor MLPipeline

Make MLPipeline accept a list of block names in the __init__ call.

Also, make all the MLPipeline subclasses work with __init__ instead of __new__

Auto hyper-parameter tuning

  • MLBlocks version:
  • Python version:
  • Operating System:

Description

I'm running a Random Forest Classifier algorithm and I'm trying to tune the algorithm hyper-parameter. However, after inspecting the code it seems that the algorithm uses the default values only and their is no automatic tuning is happening. Can you please tell me what should I do exactly in order to invoke and tune the hyper-parameter.

What I Did

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLPipeline
from mlblocks import MLBlock
from sklearn.metrics import accuracy_score as score


primitive = [
     'sklearn.ensemble.RandomForestClassifier',
  ]

pipeline_for_one = MLPipeline(primitive)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,  test_size=0.2,random_state=7)

pipeline_for_one.fit(X_train, y_train)
y_pred = pipeline_for_one.predict(X_test)

x = pipeline_for_one.blocks
xx = x.popitem()
b = xx[1]
b.instance
'''
the output was: (all default values)
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=10, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=0.1, min_samples_split=0.1,
            min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)'''

Implement partial re-fit

There should be the possibility to re-fit only a part of the pipeline, keeping the rest of the blocks as they are.

Add data type specification in JSON primitives and conversion functions

Define a standard data type for MLBlocks (pandas or numpy), and provide a way to specify whether a primitive works with this data type or needs a data conversion before or after it is used.

Also, implement conversion functions between the most common data types.

During runtime, the system should use the helper functions to convert the data types back and forth when using primitives that do not work with the standard type.
Also, the system should be smart enough to skip unnecessary conversions when passing data between blocks that support the same data type.

Implement a native way to selectively apply a primitive to only some columns

Add a way to indicate that a primitive does not need to be applied to the whole input, but only to a part of it.

The selection of which columns to apply the primitive to should be done by either column name or index, and it should be possible to pass the selection in the live variables, allowing the implementation of detector primitives that do the selection in a previous step.

Make parsers part of MLBlock

Since each JSON file being parsed represents an MLBlock, the parsing methods should be part of the MLBlock itself, or at least called from within its code.

Allow multiple blocks with the same name

Currently, the block type is its name, so there cannot be two instances of the same block in a single pipeline, because the second one would overwrite the first one.

To avoid this, the block names should automatically get some unique identifier, like a counter, in their name when they are added to the pipeline.

Update documentation with latest mlprimitives and datasets

MLPrimitives has been updated with new primitives since the documentation was created.

A new multitable dataset has been included, which can be used to showcase multitable pipelines.

A section should be added as well to show how to save and load a pipeline as a JSON.

Implement save/load

Implement a way to save and load an MLPipeline to and from a JSON file.

The JSON file structure would contain the __init__ arguments of the MLPipeline, as well as the hyperparameters and tunable_hyperparameters:

{
    "primitives": [
        "a_primitive",
        "another_primitive"
    ],
    "init_params": {
        "a_primitive": {
            "an_argument": "a_value"
        }
    },
    "hyperparameters": {
        "a_primitive#1": {
            "an_argument": "a_value",
            "another_argument": "another_value",
        },
        "another_primitive#1": {
            "yet_another_argument": "yet_another_value"
         }
    },
    "tunable_hyperparameters": {
        "another_primitive#1": {
            "yet_another_argument": {
                "type": "str",
                "default": "a_default_value",
                "values": [
                    "a_default_value",
                    "yet_another_value"
                ]
            }
        }
    }
}

This will allow two things:

  • Store and load the state of a tuned value without having to store the fitted instance.
  • Define a pipeline with a custom set of tunable hyperparameters, which might be different than the ones defined in the used primitives.

Add the possibility to append data to a variable

Sometimes it's needed to appended or merge the output of a primitive with its own input, or with some other variable.

Two possibilities exist for this:

  • Add a way to indicate that the output needs to be appended to the input. This way, for example, a primitive can add columns to X instead of replacing it.
  • Add a native primitive that allows the concatenation of multiple variables: X = pd.concat([X1, X2], axis=1)

Support reading an attribute from a class instead of calling a method?

This is a discussion issue as a followup #71

Should we natively support returning a class attribute value (or multiple) instead of calling a method of the class during the produce phase?

This could be simple achieved at the JSON level by simply not specifying the method key.

Then, we would have:

    "produce": {
        "output": [
            {
                "name": "feature_weights",
                "attribute": "feature_importances_"   # This would be optional, to be used only if the used name does not match the attribute name.
                "type": "list"
            }
        ]
    }

Keras examples crash randomly

SimpleCnn and LstmText examples crash from time to time.

It seems to be triggered for some random values used internally.

Add command to run all examples to Makefile

Add command to find and run all example modules to Makefile.

This will allow us to run the examples as an interim health-check while we do not have proper unit tests in place.

Currently it is not possible to run the examples as a tox stage because examples behavior is not deterministic and sometimes they crash.

Move argument parsing to MLBlock

The current version of MLBlocks is preparing the keyword arguments for the MLBlock, including keyword replacement and the lookup for a default value if an argument is missing inside the MLPipeline.

The consequence of this is that if a primitive has defined an argument name different than the keyword expected by the actual primitive method, the MLBlock instance expects to be called with the actual primitive keyword instead of the one defined in the primitive JSON.

An example of this can be seen in the numpy.argmax primitive:

The JSON contains:

    "produce": {
        "args": [
            {
                "name": "y",
                "keyword": "a",
                "type": "ndarray"
            }
        ],
        ...
    }

As a consequence of this, the block is exposing the argument name as y:

>>> block = MLBlock('numpy.argmax')
>>> block.produce_args
[{'name': 'y', 'keyword': 'a', 'type': 'ndarray'}]

But, since the keyword argument is not being parsed by the MLBlock class itself, if we call it using the y keyword it fails:

>>> block.produce(y=[[1, 2, 3]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xals/Projects/MIT/MLBlocks/mlblocks/mlblock.py", line 310, in produce
    return self.primitive(**produce_kwargs)
TypeError: argmax() got an unexpected keyword argument 'y'

And the same would happen if there was a default value defined.

To fix this, we can move the parsing logic from the MLPipeline to the MLBlock, so that the block methods can also be called with the exposed argument names and also have default values.

Isolate primitives from their hyperparameters dictionary

The current implementation does not isolate the MLBlock hyperparameters dictionary from the underlying primitive, allowing this to modify its content and leading to unexpected behaviors and bugs.

This isolation should be enforced by using deepcopy instead of a simple copy when returning the hyperparameters in the get_hyperparameters method, and by always accessing the hyperparameters through this method instead of directly when passing them to the underlying primitive.

Improve Logging

Add logging statements in the code.

Also, find a way to show the name of the failing block when there is an exception within a pipeline.

Add more datasets

Add more datasets to cover a broader scope of data modalities and task types:

data_modality task_type done dataset
audio classification no ESC-50
graph communityDetection no 6_70_com_amazon
graph graphMatching no LL1_DIC28_net
graph linkPrediction yes UMLS
graph vertexNomination no LL1_net_nomination_seed
image classification yes USPS
image regression yes HandGeometry
single_table classification yes Iris
single_table collaborativeFiltering no 60_jester
single_table regression yes Boston
text classification no Personae
timeseries classification no LL1_Trace

Change the pipeline `*_params` format

All the pipelines parameters, init_params, fit_params, etc. are expected to be a dict with the format {(block_name, param_name): param_value} as input.

This format is not JSON serializable because of the tuple, and needs to be converted to a different format inside the MLPipeline in order to interact with the MLBlocks.

I suggest using nested dictionaries, as in {block_name: {param_name: param_value}} as input, which will be JSON serializable (as far as the value also is), and will greatly simplify the internal MLPipeline implementation.

Add a demo data module

A method should be included with functions to download data for testing from an S3 bucket.

Integrate Travis and set the Gh-pages

Travis was removed some days ago because the repository was private and Travis had no visibility over it, but now that it's public we can set it up and use it to create the gh-pages documentation.

The work to be done is:

  • Add travis configuration to the repo and set the project up in travis
  • Configure the repository to have the documentation served from gh-pages branch

Add primitive requirements parsing and validation

Add a JSON keyword to specify dependencies or requirements and add a validation of those during the MLBlock instantiation.

If a dependency is missing or an incompatible version is installed, a user friendly exception should be raised when the MLBlock is created asking the user to install the required dependency.

Add primitive caching

The primitive JSON files should be looked for in at least two places:

  • the package primitives directory
  • a custom user directory, such as ./mlblocks_primitives

There should also be a method to add a custom folder where primitives should be taken from.

There should also be a dict somewhere in the package where the loaded primitives would be stored "in memory" to avoid having to load them from disk more than once.

Support flat hyperparameter dictionaries

MLBlocks hyperparameter specification format consists of a hierarchical tree of dictionaries where each key is a block name and each value is a dictionary with the complete hyperparameter specification for that block.

BTB, instead, specifies the hyperparameters as a flat dictionary where the keys are two element tuples containing the name of the block in the first place and the name of the hyperparameter in the second place, and where the values are the corresponding hyperparameter values.

We should add support for that format in the following ways:

  • The internal and main format will continue to be the hierarchical one.
  • MLPipeline.set_hyperparameters will accept both flat and hierarchical formats as input. If a flat dictionary is passed, it will be converted to a hierarchical one.
  • MLPipeline.get_hyperparameters will accept a new argument, flat=False. If set to True, a flat dictionary will be returned. Otherwise (default), the hierarchical one will be returned.
  • MLPipeline.get_tunable_hyperparameters will accept a new argument, flat=False. If set to True, a flat dictionary will be returned. Otherwise (default), the hierarchical one will be returned.

Filter conditional hyperparameters during MLBlock initialization

When an MLBlock instance is created, if there is a conditional hyperparameter and the condition hyperparameter has been given as an init param, the conditional hyperparameter should be read as a regular tunable hyperparameter.

An example of this would be a primitive with the following hyperparameters:

"hyperparameters": {
    "foo": {
        "type": "str",
        "values": ["a", "b"]
    },
    "bar": {
        "type": "conditional",
        "condition": "foo",
        "values": {
            "a": {
                "type": "int",
                "range": [1, 10]
            "b": {
                "type": "float",
                "range": [0.0, 0.1]
             }
        }
    }
}

That is instantiated as:

pipeline = MLPipeline(['a_primitive'], {'a_primitive': {'foo': 'a'}})

In this case, calling the get_tunable_hyperparameters should return:

{
    "bar": {
        "type": "int",
        "range": [1, 10]
    }
}

Problem with pipeline.fit()

  • MLBlocks version: 0.2.0
  • Python version: 3.6.5
  • Operating System: macOS

Description

I was walking through the tutorial and everything was working just fine, however after I called the pipeline.fit() function a TypeError was raised .

What I Did

#The commands I ran, before the crash.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from mlblocks import MLPipeline
import pandas as pd

primitives = [
     'sklearn.preprocessing.StandardScaler',
    'sklearn.ensemble.RandomForestClassifier'
 ]
pipeline = MLPipeline(primitives)

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,  test_size=0.2,random_state=7)

pipeline.fit(X_train, y_train)


Traceback (most recent call last):

  File "<ipython-input-269-94584d36b1c1>", line 1, in <module>
    runfile('/Users/najat/Documents/MLBlocks/mlblocks_test.py', wdir='/Users/najat/Documents/MLBlocks')

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/najat/Documents/MLBlocks/mlblocks_test.py", line 23, in <module>
    pipeline.fit(X_train, y_train)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/mlblocks/mlpipeline.py", line 139, in fit
    block.fit(**fit_args)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/mlblocks/mlblock.py", line 120, in fit
    getattr(self.instance, self.fit_method)(**fit_args)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 590, in fit
    return self.partial_fit(X, y)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 612, in partial_fit
    warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 41, in _assert_all_finite
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())

  File "/Users/najat/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py", line 35, in _sum
    return umr_sum(a, axis, dtype, out, keepdims, initial)

TypeError: reduce() takes at most 5 arguments (6 given)

Improve README example

The README example should show:

  • Multiple libraries
  • A pipeline save and load cycle
  • The usage of tunable hyperparameters

Add Unit Tests

All the test should be covered with Unit Test.

This issue should be used to give an initial 100% coverage before we openly request external contribution to the project.

Allow getting intermediate outputs

Introduce a way to extract intermediate outputs from a Pipeline without having to rebuild it and refit it.

The functionality should be something along this lines:

primitives = ['a', 'b', 'c', 'd']
pipeline = MLPipeline(['a', 'b', 'c', 'd'])
pipeline.fit(train_X, train_y)
output_of_b = pipeline.predict(train_X, output_='b')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.