reiinakano / xcessiv Goto Github PK

A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.

Home Page: http://xcessiv.readthedocs.io

License: Apache License 2.0

Python 59.54% HTML 0.44% CSS 0.90% JavaScript 39.12%

machine-learning ensemble-learning stacked-ensembles scikit-learn data-science hyperparameter-optimization automated-machine-learning

xcessiv's Introduction

Xcessiv

Xcessiv is a tool to help you create the biggest, craziest, and most excessive stacked ensembles you can think of.

Stacked ensembles are simple in theory. You combine the predictions of smaller models and feed those into another model. However, in practice, implementing them can be a major headache.

Xcessiv holds your hand through all the implementation details of creating and optimizing stacked ensembles so you're free to fully define only the things you care about.

The Xcessiv process

Define your base learners and performance metrics

Keep track of hundreds of different model-hyperparameter combinations

Effortlessly choose your base learners and create an ensemble with the click of a button

Features

Fully define your data source, cross-validation process, relevant metrics, and base learners with Python code
Any model following the Scikit-learn API can be used as a base learner
Task queue based architecture lets you take full advantage of multiple cores and embarrassingly parallel hyperparameter searches
Direct integration with TPOT for automated pipeline construction
Automated hyperparameter search through Bayesian optimization
Easy management and comparison of hundreds of different model-hyperparameter combinations
Automatic saving of generated secondary meta-features
Stacked ensemble creation in a few clicks
Automated ensemble construction through greedy forward model selection
Export your stacked ensemble as a standalone Python file to support multiple levels of stacking

Installation and Documentation

You can find installation instructions and detailed documentation hosted here.

FAQ

Where does Xcessiv fit in the machine learning process?

Xcessiv fits in the model building part of the process after data preparation and feature engineering. At this point, there is no universally acknowledged way of determining which algorithm will work best for a particular dataset (see No Free Lunch Theorem), and while heuristic optimization methods do exist, things often break down into trial and error as you try to find the best model-hyperparameter combinations.

Stacking is an almost surefire method to improve performance beyond that of any single model, however, the complexity of proper implementation often makes it impractical to apply them in practice outside of Kaggle competitions. Xcessiv aims to make the construction of stacked ensembles as painless as possible and lower the barrier for entry.

I don't care about fancy stacked ensembles and what not, should I still use Xcessiv?

Absolutely! Even without the ensembling functionality, the sheer amount of utility provided by keeping track of the performance of hundreds, and even thousands of ML models and hyperparameter combinations is a huge boon.

How does Xcessiv generate meta-features for stacking?

You can choose whether to generate meta-features through cross-validation (stacked generalization) or with a holdout set (blending). You can read about these two methods and a lot more about stacked ensembles in the Kaggle Ensembling Guide. It's a great article and provides most of the inspiration for this project.

Contributing

Xcessiv is in its very early stages and needs the open-source community to guide it along.

There are many ways to contribute to Xcessiv. You could report a bug, suggest a feature, submit a pull request, improve documentation, and many more.

If you would like to contribute something, please visit our Contributor Guidelines.

Project Status

Xcessiv is currently in alpha and is unstable. Future versions are not guaranteed to be backwards-compatible with current project files.

xcessiv's People

Contributors

Stargazers

Watchers

Forkers

bigrlab machinelearningorg neo4reo cnx-brunomelo ghellstern mistercoffey hanbman neozoik tmthyjames bermuda jwingnut techscientist jmrinaldi enavarroai cw-delli-bird d4le ryanliwag ashiqrh anamariakantar rouseguy bchalamayya jef5ez theseusyang epistimos enisnazif juanlp xkhldy gth158a doingodswork37 krybkin menglewis surlybot canoefzh tongli12 toast707 ml-lab daci6920 yutiansut ferplascencia tmjones sagarora77 tony32769 clxdsjyx manford originaltebas mrlevo520 stethd mehrdad-shokri xiaoyexixi neerajsarwan jzhang45 pombredanne autodataplatform kjeanclaude iamsubhokarmakar mindis standby cosecant-csc payback80 srmchcy yangyang233 loneknightz vishalbelsare ejdanderson khaledto foo-l datalee afcarl wei-he todun algoding mejihero chengguobiao lijielife vladperervenko norbertstrzelecki zeeshan75 jhendric98 zuoshaobo jennyhsiao fitrialif neurotics hhy5277 deepdeepdot yishuihanhan bytearchive barryzm septumcapital databill86 lxs0202 thongvm rena-ganba angelacy taogeanton2 git04112019 helloworld1983 laokpa bluesss30 actuarial-tools mshans66

xcessiv's Issues

pause or abort auto run?

Is there a way to pause or abort an autorun?

I have been getting weird errors on a recent run and it would have been helpful to have had a way to abort it so that I can investigate but without deleting the job and losing the results

Add new preset estimators, metrics, and cv iterators

Sklearn has a TON of estimators, metrics, and cv iterators that could trivially be added to the xcessiv.presets package. I'm a bit focused on other issues to bother adding them all.

Anyone who can help add to the list can easily do so.

Adding preset estimators/metrics/cvs is very easy. There's literally no need to understand how the rest of Xcessiv works, just take a look and copy the patterns in the xcessiv.presets package. Also, add corresponding relevant tests for your addition.

Please keep PR's limited to one feature addition only for easy debugging and reformatting if needed. Of course, you can submit as many PR's as you like :)

extracting results

how ca i extract the predictions generated ???

no export button?

I just built a stacked ensemble but there doesn't seem to be an export button?

Automated ensembling techniques

Working for a while with Xcessiv, I feel there's a need for some way to automate the selection of base learners in an ensemble. I'm unaware of existing techniques for this, so if anyone has any suggestions or could point me towards relevant literature, it would be greatly appreciated.

Error with tuple hyperparameters

Hello @reiinakano

I have an issue with tuple hyperparameters. I think the bracket are ommited during the training

from gplearn.genetic import SymbolicRegressor base_learner = SymbolicRegressor(random_state=8,n_jobs=6)

params = {'function_set':('sqrt','add', 'sub', 'mul', 'div','max','min','log'), 'const_range':(-30, 280), 'init_depth':(6, 13) 'population_size': 150, 'generations': 600, 'stopping_criteria': 0.001, 'p_crossover': 0.45, 'p_subtree_mutation': 0.15, 'p_hoist_mutation': 0.15, 'p_point_mutation': 0.25, 'max_samples': 0.85, 'metric': 'rmse', 'parsimony_coefficient': 0.0002}

Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/site-packages/xcessiv/rqtasks.py", line 135, in generate_meta_features est = est.fit(X_train, y_train) File "/root/anaconda3/lib/python3.6/site-packages/gplearn/genetic.py", line 366, in fit raise ValueError('const_range should be a tuple with length two.') ValueError: const_range should be a tuple with length two.

xcessiv server

Hi
This project looks very cool, but I am having some problems with the setup.
I am running this in a container (my own), and I can't get the server to show up. From inside the container I can see the server running - ps shows xcessiv running and curl localhost:1994 gives me some HTML from xcessiv. From outside the container, however, there's nothing.

I suppose that's down to the server.py file which I have now changed to this:

  1 from __future__ import absolute_import, print_function, division, unicode_literals¬                                                                              
  2 from gevent.wsgi import WSGIServer¬
  3 # import webbrowser¬
  4 ¬
  5 ¬
  6 def launch(app):¬
  7     http_server = WSGIServer(('0.0.0.0', app.config['XCESSIV_PORT']), app)¬
  8     # webbrowser.open_new('http://localhost:' + str(app.config['XCESSIV_PORT']))¬
  9     http_server.serve_forever()¬

I have changed the WSGIServer setup to be open to outside connection (I suppose that's what I changed), but it's still not showing up.

Feedback appreciated. I'd like to try this out.
Thanks!

Support Redis Passwords

Any chance of adding Redis password config support? I set up an instance at https://app.redislabs.com and require a password to connect.

Will look into it on my own as well.

How to deal with large amounts of data?

I am trying to use Xcessiv to an image classification project (together with Keras or PyTorch). By reading the walkthrough, I found that I have to pass the entire dataset as a (X, y) tuple. This is unfeasible for a large image dataset. How can I outcome this problem?

One solution that I thought about was to pass image paths as the X, and let fit load the data lazily. Is this the best approach?

PS: Thanks for creating and maintaining Xcessiv!

'dict_keys' object does not support indexing

On lines 306 and 309 of views.py, trying to index a dictionary keys object will fail on Python 3 and result in a server error. The fix is simple: change all occurrences of

base_learner_origin.validation_results.keys()[0]

list(base_learner_origin.validation_results.keys())[0]

Cannot install on Windows ..

Environment = Anaconda Python 3.6

Os= Windows 10

Collecting xcessiv Using cached xcessiv-0.1.1.tar.gz Complete output from command python setup.py egg_info: Traceback (most recent call last): File "<string>", line 1, in <module> File "C:\Users\Arun\AppData\Local\Temp\pip-build-enuakn0v\xcessiv\setup.py", line 8, in <module> import xcessiv File "C:\Users\Arun\AppData\Local\Temp\pip-build-enuakn0v\xcessiv\xcessiv\__init__.py", line 12, in <module> import xcessiv.views File "C:\Users\Arun\AppData\Local\Temp\pip-build-enuakn0v\xcessiv\xcessiv\views.py", line 5, in <module> from rq import Connection File "C:\Anaconda3\envs\python36\lib\site-packages\rq\__init__.py", line 11, in <module> from .worker import SimpleWorker, Worker File "C:\Anaconda3\envs\python36\lib\site-packages\rq\worker.py", line 88, in <module> class Worker(object): File "C:\Anaconda3\envs\python36\lib\site-packages\rq\worker.py", line 361, in Worker def kill_horse(self, sig=signal.SIGKILL): AttributeError: module 'signal' has no attribute 'SIGKILL'

ValueError: There exists an active worker named [Docker]

I'm using xcessiv inside docker, with docker-compose up.
When I issue a docker restart (because, sometimes jobs get stuck, maybe this is another issue) I get the following:

xcessiv_1  | Process Process-2:
xcessiv_1  | Traceback (most recent call last):
xcessiv_1  |   File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
xcessiv_1  |     self.run()
xcessiv_1  |   File "/usr/local/lib/python2.7/multiprocessing/process.py", line 114, in run
xcessiv_1  |     self._target(*self._args, **self._kwargs)
xcessiv_1  |   File "/usr/local/lib/python2.7/site-packages/xcessiv/scripts/runworker.py", line 17, in runworker
xcessiv_1  |     w.work()
xcessiv_1  |   File "/usr/local/lib/python2.7/site-packages/rq/worker.py", line 446, in work
xcessiv_1  |     self.register_birth()
xcessiv_1  |   File "/usr/local/lib/python2.7/site-packages/rq/worker.py", line 259, in register_birth
xcessiv_1  |     raise ValueError(msg.format(self.name))
xcessiv_1  | ValueError: There exists an active worker named u'4781ae60cd1b.26' already

I read that

By default, the name of a worker is equal to the concatenation of the current hostname and the current PID

and this is probably the problem with docker, which does not change either of the two things after a restart.

Has anybody any solution, apart restarting the whole machine?

ImportError: No module named wsgi

File "/usr/local/Cellar/python/2.7.14/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xcessiv/server.py", line 2, in
from gevent.wsgi import WSGIServer
ImportError: No module named wsgi

Issues with TfidnVectorizer

Hey, great tool.

I have a problem though when I am trying to use a TfidfVectorizer for Text Classification. When I create a Single Base Learner I get the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly .

The type of the X variable is an numpy.ndarray, but if I don't convert the variable X to an array then I get the error message:

TypeError: Singleton array array(<92820x194 sparse matrix of type '<class 'numpy.float64'>' with 92820 stored elements in Compressed Sparse Row format>, dtype=object) cannot be considered a valid collection.

I choose the preset learner setting scikit-learn Random Forest as a Base Learner Type.

import os
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

def extract_main_dataset():
    # pandas data frame with the columns Classification, FeatureVector
    # ie:
    # 0, 'This is the feature vector'
    # 1, 'This is another feature vector' 
    # 2, 'This is yet another feature vector' 
    # 1, 'This is the last feature vector example' 
    with open('feature_vector.pik', 'rb') as rf:
        feature_vector = pickle.load(rf)

    y = np.array(feature_vector.Classification.values)
    title_rf_vectorizer = TfidfVectorizer(ngram_range=(2, 9),
                                          sublinear_tf=True,
                                          use_idf=True,
                                          strip_accents='ascii')

    title_rf_classifier = RandomForestClassifier(n_estimators=100, n_jobs=8)
    X = title_rf_vectorizer.fit_transform(feature_vector["Classification"]).toarray()
    return X, y

Base Learner Correlation Matrix

First of all, big props for this project! A big help in constructing big stacking models.

It would maybe be interesting to get some visualizations in the tool, like e.g. a correlation matrix between the meta-features.

If I ever get some time to spare, I'll start reading up on the code base and see if I can integrate it.

integration with TensorFlow

While the current interface provided by xcessiv is, well excessively nice, I am hoping that it will be possible to integrate TensorFlow models / estimators with it. Has this been done, or would it be possible to add any documentation with a basic example?

networking on docker

Hi,

I found the README for the docker container a bit confusing...

What I did to run the 2 containers (Redis and xcessiv) was simply:

docker run --name some-redis -d redis
docker run -P --name='xcessiv' --link some-redis:redis reiinakano/xcessiv xcessiv -H redis -P 6379

without the need to figure out the ip of the redis container.

Cancel/clear queued base learners

Accidentally executed a huge grid search. Even starting/stopping Redis and Xcessive won't clear the queue and theres no way to cancel all pending tasks. Please advise.

Leave a python terminal/notebook for debugging

Hi, I think the potential of this tool is enormous, but I have a lot of problems even only in loading the data/debugging. If only I had a terminal to execute custom code in the web interface and retrieve the stdout/err maybe I could understand what is going on under the hood and debug. Instead, I have only a silly exception printed on the screen. :(

Memory management

First of all, thanks! I find this project fascinating. My question/issue is about how do you handle the memory for multiple processes. By default Python will create a copy of the data per process. This is prohibitive for large datasets.

How did you manage this problem?

Can't Use

Sorry for what is no doubt a stupid question:

I've started Redis via redis-server. It says it's running on port 6379. Then I run xcessiv, but it takes me to a page that's not found. The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.. Any idea what I can do? I'm really eager to use Xcessiv.

Add support for multiple datasets

Hi there!

This is an AWESOME library! I simply cannot express my happiness with this super helpful library enough..

Anyways, I think it would be really nice if Xcessiv supports multiple datasets, similar to its support for multiple estimators, ensembles, etc.

This way, we could define and import multiple datasets, and then configure each individual base estimator instance to take input from a user-specified dataset that they have defined. This would be important for importing heterogeneous data into a subset of the estimators, and even import different versions of the same dataset (classification, regression, etc.) so that some estimators can be classifiers, some can be regressors, and regressors can take input from classifier predict_proba results, etc.

Will this implementation break the existing one? Or, in other words, is it possible to extend the current version to support multiple datasets? If you know some tips for developing this, then kindly share them with me and I will also try to implement it (I'm kind of new to GitHub in terms of contributing code, so if you can give me a brief pointer to the file/packages I would need to look into, I'll figure out the rest).

Thank you so much for your hard work! This library is a God-send for all machine learning developers, newbies, and Kagglers out there!

Presets for choosing base_learners to create stacked ensembles

Let me preface this feature request by saying that I absolutely love this project, and would be happy to contribute however I can.

As implied by the title, I think it would be great if there were some preset options (other than Automated Ensemble Search) in creating stacked ensembles from the existing list of base_learners. Some suggestions:

Create ensemble from the k best (as defined by any of the chosen metrics) base_learners
Create ensemble from the k best base_learners per base_learner type. e.g the 3 best XGBoost classifiers, the 3 best RF classifiers, etc.
Random ensemble creation

The _BasePipeline in exported Python script should be _BaseComposition

Since scitkit-learn 0.19.x, the base class for Pipeline has changed to _BaseComposition.
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
When using the generated code for training, it raises a name-not-found error on newer versions of sklearn.
At the moment, an easy workaround is to change two instances of the word manually in the generated script.

How to import homemade modules in Xcessiv?

I'm trying to import homemade module named preprocessing_115v (filename preprocessing_115v.py) into the main data extraction source code but I can't seem to find it :

#############
import preprocessing_115v <-- where do I store the preprocessing_115v.py file for it to load here?
def extract_main_dataset():
import pandas as pd
df=pd.read_csv('./data.csv', sep=',',header=None)
X=df.values
labels=pd.read_csv('./labelsnum.csv', sep=',',header=None)
y=labels.values
y=y[:,0]
return X, y
##############

Amazing program by the way :-)

Error when fitting exported ensemble

After exporting my ensemble and trying to fit the new base_learner to my data, I encountered the following error:

FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]

I don't fully understand the error, but after a bit of digging I found that the guilty line was

base_learner.fit(X[train_idx], y[train_idx]), under the fit method of the XcessivStackedEnsemble class. I fixed this by using y.iloc[train_idx] instead of y[train_idx].

I don't know if this is just a case of human error, but for anyone else having the same issue, this fixed it for me.

Valid values for metric to optimise in bayesian optimisation?

Is there a list of valid metric_to_optimise for Bayesian Optimisation?

I am using sklearn mean_squared_regression for my base learning but when I enter that into the Bayesian Optimisation menu under metric_to_optimise I get:

assert module.metric_to_optimize in automated_run.base_learner_origin.metric_generators
AssertionError

Add a docker-compose to simplify Xcessiv startup with Redis

The wiki about how to start Xcessiv implicitly assumes an already running Redis instance. This is something docker-compose can handle for the end-user whereby starting Xcessiv would also automatically start Redis.

I have added a docker-compose in PR #47 which will start Redis out-of-the-box and defines a shared data directory. By simply issuing $docker-compose up Redis and Xcessiv will both start together.

Unit tests

Now that Xcessiv is in a (sort of) stable state, it's time to correct my bad practice and start writing comprehensive unit tests. Untested code is broken code.

If anybody wants to help out, that would be very awesome! If you're a React.js dev willing to write tests for the JS part of Xcessiv, that's extra awesome! This was my first ever project using Javascript and I have a lot to learn in that direction.

That's amazing! congrats man!

XGBRegressor model stuck in queued status

I tried to make a regression model to run on zillow data from kaggle available here https://www.kaggle.com/c/zillow-prize-1/data
Here is a gist of my dataset extractor as well as setting up the XGBRegressor and an exception that was the last thing left in the console
https://gist.github.com/jef5ez/a9b0650293f343682a58b0f0500f3332
I selected the shuffle split for both cross validation settings and added MSE as the learner metric.
The base learner seems to verify fine on the boston housing data.
After hitting finalize and selecting a single base learner a row shows up below but is stuck in the Queued status.

python 3.5.2
xcessiv (0.2.2)
xgboost (0.6a2)

Feature Request - Backup .db file

I got an error something to the effect of "Error with JSON "N" at position 8345", presumably caused by my manually editing the code for one of the base learners. Once I got this error however, none of the base learners in my project would load. I resolved it by manually deleting the base learner I had been editing from the .db file. I'll post the specifics if I can recreate it, but I'm wondering if it might be prudent to have some kind of db backup/"Last Known Good Configuration"?

Add train and test error

Hi, it would be nice to display train and test error so we can evaluate easier if a model is overfitting.
Very nice work so far.
Thanks

redis.exceptions.DataError at xcessiv launch

Hello,
When I try to launch xcessiv I get an error:

Traceback (most recent call last): File "/PATH_TO/anaconda3/bin/xcessiv", line 10, in <module> sys.exit(main()) File "/PATH_TO/anaconda3/lib/python3.7/site-packages/xcessiv/scripts/runapp.py", line 51, in main redis_conn.get(None) # will throw exception if Redis is unavailable File "/PATH_TO/anaconda3/lib/python3.7/site-packages/redis/client.py", line 1264, in get return self.execute_command('GET', name) File "/PATH_TO/anaconda3/lib/python3.7/site-packages/redis/client.py", line 774, in execute_command connection.send_command(*args) File "/PATH_TO/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 620, in send_command self.send_packed_command(self.pack_command(*args)) File "/PATH_TO/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 663, in pack_command for arg in imap(self.encoder.encode, args): File "/PATH_TO/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 125, in encode "byte, string or number first." % typename) redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first.

Previously I had to change
from gevent.wsgi import WSGIServer
to
from gevent.pywsgi import WSGIServer
as indicated in this issue

My server is responding when I do redis-cli ping

I am on Ubuntu 18.04, with python 3.7.3 and redis 5.0.5

Do you have an idea to fix this?
Thanks!