Coder Social home page Coder Social logo

auto-flow / auto-flow Goto Github PK

View Code? Open in Web Editor NEW
66.0 66.0 5.0 12.58 MB

AutoFlow : Automatic machine learning workflow modeling platform

Home Page: https://auto-flow.github.io/auto-flow/

License: Other

Makefile 0.19% Python 99.81%
automl catboost data-minig data-sicence lightgbm machine-learning workflow

auto-flow's Introduction

AutoFlow

AutoFlow : automatic machine learning workflow modeling platform

Introduction

In the problem of data mining and machine learning of tabular data, data scientists usually group the features, construct a directed acyclic graph (DAG), and form a machine learning workflow.

In each directed edge of this directed acyclic graph, the tail node represents the feature group before preprocessing, and the head node represents the feature group after preprocessing. Edge representation data processing or feature engineering algorithms, in each edge algorithm selection and hyper-parameter optimization are doing.

Unfortunately, if data scientists want to manually select algorithms and hyper-parameters for such a workflow, it will be a very tedious task. In order to solve this problem, we developed the AutoFlow, which can automatically select algorithm and optimize the parameters of machine learning workflow. In other words, it can implement AutoML for tabular data.

image

Documentation

The documentation can be found here.

Installation

Requirements

This project is built and test on Linux system, so Linux platform is required. If you are using Windows system, WSL is worthy of considerarion.

Besides the listed requirements (see requirements.txt), the random forest used in SMAC3 requires SWIG (>= 3.0, <4.0) as a build dependency. If you are using Ubuntu or another Debain Linux, you can enter following command :

apt-get install swig

On Arch Linux (or any distribution with swig4 as default implementation):

pacman -Syu swig3
ln -s /usr/bin/swig-3 /usr/bin/swig

AutoFlow requires Python 3.6 or higher.

Installation via pip

pip install auto-flow

Manual Installation

git clone https://github.com/auto-flow/autoflow.git && cd autoflow
python setup.py install

Quick Start

Titanic is perhaps the most familiar machine learning task for data scientists. For tutorial purposes, you can find titanic dataset in examples/data/train_classification.csv and examples/data/test_classification.csv . You can use AutoFlow to finish this ML task instead of manually exploring all the features of the dataset. DO IT !

$ cd examples/classification
import os

import joblib
import pandas as pd
from sklearn.model_selection import KFold

from autoflow import AutoFlowClassifier

# load data from csv file
train_df = pd.read_csv("../data/train_classification.csv")
test_df = pd.read_csv("../data/test_classification.csv")
# initial_runs  -- initial runs are totally random search, to provide experience for SMAC algorithm.
# run_limit     -- is the maximum number of runs.
# n_jobs        -- defines how many search processes are started.
# included_classifiers -- restrict the search space . lightgbm is the only classifier that needs to be selected
# per_run_time_limit -- restrict the run time. if a trial during 60 seconds, it is expired, should be killed.
trained_pipeline = AutoFlowClassifier(initial_runs=5, run_limit=10, n_jobs=1, included_classifiers=["lightgbm"],
                                    per_run_time_limit=60)
# describing meaning of columns. `id`, `target` and `ignore` all has specific meaning
# `id` is a column name means unique descriptor of each rows,
# `target` column in the dataset is what your model will learn to predict
# `ignore` is some columns which contains irrelevant information
column_descriptions = {
    "id": "PassengerId",
    "target": "Survived",
    "ignore": "Name"
}
if not os.path.exists("autoflow_classification.bz2"):
    # pass `train_df`, `test_df` and `column_descriptions` to classifier,
    # if param `fit_ensemble_params` set as "auto", Stack Ensemble will be used
    # ``splitter`` is train-valid-dataset splitter, in here it is set as 3-Fold Cross Validation
    trained_pipeline.fit(
        X_train=train_df, X_test=test_df, column_descriptions=column_descriptions,
        fit_ensemble_params=False,
        splitter=KFold(n_splits=3, shuffle=True, random_state=42),
    )
    # finally , the best model will be serialize and store in local file system for subsequent use
    joblib.dump(trained_pipeline, "autoflow_classification.bz2")
    # if you want to see what the workflow AutoFlow is searching, you can use `draw_workflow_space` to visualize
    hdl_constructor = trained_pipeline.hdl_constructors[0]
    hdl_constructor.draw_workflow_space()
# suppose you are processing predict procedure, firstly, you should load serialized model from file system
predict_pipeline = joblib.load("autoflow_classification.bz2")
# secondly, use loaded model to do predicting
result = predict_pipeline.predict(test_df)
print(result)

auto-flow's People

Contributors

hyper-flow avatar tqcai avatar tqichun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.