Coder Social home page Coder Social logo

jubakit's Introduction

Jubatus

https://api.travis-ci.org/jubatus/jubatus.svg?branch=master

The Jubatus library is an online machine learning framework which runs in distributed environment.

See http://jubat.us/ for details.

Quick Start

We officially support Red Hat Enterprise Linux (RHEL) 6.2 or later (64-bit) and Ubuntu Server 14.04 LTS / 16.04 LTS / 18.04 LTS (64-bit). On supported systems, you can install all components of Jubatus using binary packages.

See QuickStart for detailed description.

Red Hat Enterprise Linux 6.2 or later (64-bit)

Run the following command to register Jubatus Yum repository to the system.

// For RHEL 6
$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/6/stable/x86_64/jubatus-release-6-2.el6.x86_64.rpm

// For RHEL 7
$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/7/stable/x86_64/jubatus-release-7-2.el7.x86_64.rpm

Then install jubatus and jubatus-client package.

$ sudo yum install jubatus jubatus-client

Now Jubatus is installed in /usr/bin/juba*.

$ jubaclassifier -f /usr/share/jubatus/example/config/classifier/pa.json

Ubuntu Server (64-bit)

Write the following line to /etc/apt/sources.list.d/jubatus.list to register Jubatus Apt repository to the system.

// For Ubuntu 12.04 (Precise) - Deprecated (unsupported)
deb http://download.jubat.us/apt/ubuntu/precise binary/

// For Ubuntu 14.04 (Trusty)
deb http://download.jubat.us/apt/ubuntu/trusty binary/

// For Ubuntu 16.04 (Xenial)
deb http://download.jubat.us/apt/ubuntu/xenial binary/

// For Ubuntu 18.04 (Bionic)
deb [trusted=yes] http://download.jubat.us/apt/ubuntu/bionic/binary /

Now install jubatus package.

$ sudo apt-get update
$ sudo apt-get install jubatus

Now Jubatus is installed in /opt/jubatus/bin/juba*.

$ source /opt/jubatus/profile
$ jubaclassifier -f /opt/jubatus/share/jubatus/example/config/classifier/pa.json

Other Platforms

For other platforms, refer to the documentation.

License

LGPL 2.1

Third-party libraries included in Jubatus

Jubatus source tree includes following third-party library.

  • cmdline (under BSD 3-Clause License)

Jubatus requires jubatus_core library. jubatus_core contains Eigen and fork of pficommon. Eigen is licensed under MPL2 (partially in LGPL 2.1 or 2.1+). The fork of pficommon is licensed under New BSD License.

Update history

Update history can be found from ChangeLog or WikiPage.

Contributors

Patches contributed by those people.

jubakit's People

Contributors

kmaehashi avatar komainu8 avatar rimms avatar sakuraikaito avatar shiodat avatar tkrudagawa avatar torash avatar yukimori avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jubakit's Issues

Fix to work on OS X (El Capitan or later)

Due to jubatus/jubatus#892, jubakit cannot be used on macOS (>= El Capitan) on non-standard installation path (added on 2016-08-24). In the meantime I want to add workaround code.
I'm thinking of defining DYLD_LIBRARY_PATH before executing Jubatus server process based on PATH.

Integrate model file manipulation tools

We often need to directly work with saved model files (call save RPC, move the file from /tmp, modify the model file, copy over to another server, call load RPC, etc.)
It is very convenient if these features are built-in to one jubakit package.

We already have a CLI tools to do this (but only work with Py2.)

https://github.com/kmaehashi/jubatus-utils

I'd like to integrate jubamodel (that can manually modify the model contents and fix corrupt CRC headers) and jubafetch (that can seamlessly save / load model files) commands into jubakit.

Non-static Datasets does not allow automatic Schema prediction

Non-static Datasets does not allow automatic Schema prediction. It is inconvenient and should be fixed.

>>> from jubakit.loader.array import ZipArrayLoader
>>> from jubakit.anomaly import Dataset
>>> 
>>> loader = ZipArrayLoader(k1=[1,4], k2=[2,5], k3=[3,6])
>>> ds = Dataset(loader, static=False)
>>> for x in ds: print(x)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kenichi/Development/jubakit/jubakit/base.py", line 393, in __iter__
    yield (self._index, self._schema.transform(row))
AttributeError: 'NoneType' object has no attribute 'transform'

Tornado 4.5 doesn't support python2.6

Tornado 4.5 which has been released a few days ago doesn't support python 2.6.
So now jubakit can not build by python 2.6 (Trivis-CI error).

note.
The msgpack module depends on the tornado

jubamodel: add config replacer

For advanced users it is convenient if config in model files can be replaced.
For example, local_sensitivity option of NN-based classifier is an estimation-time option, i.e., it is okay to change it after training.

I propose adding --replace-config option that replaces the config in the model file.

$ jubamodel --replace-config new_config.json -o binary -O new_model.jubatus old_model.jubatus

typo in variable name of error message when port number conflict

Exception ignored in: <bound method _ServiceBackend.__del__ of <jubakit.base._ServiceBackend object at 0x7f4ff8e4b6d8>>
Traceback (most recent call last):
  File "/home/jubatus/Development/jubakit/jubakit/base.py", line 576, in __del__
    self._unassign_port(self.port, proc.pid)
  File "/home/jubatus/Development/jubakit/jubakit/base.py", line 659, in _unassign_port
    raise RuntimeError('port {0} is used by PID {1}, not PID {2}'.format(port, m[port], pcid))
NameError: name 'pcid' is not defined
659       raise RuntimeError('port {0} is used by PID {1}, not PID {2}'.format(port, m[port], pcid))

pcid should be pid.

Add integration tests

Currenltly jubakit integration test is only done by example scripts, which does not cover very well, especially the error cases (Jubatus process invocation failure etc.) We need to add such tests.

Discuss about what service to implement

Currently the following services are not implemented and does not have a specific issue ticket.

  • Stat
  • Graph
  • Bandit

I think it is better to implement them upon request of user.

Add RDBMS loader

Loading training/test dataset from RDBMS (PostgreSQL, MySQL, ODBC, etc.) is one of the common needs.

Missing "clear_row" method in jubakit.Recommender .

Problem Summary

In official document, there is a clear_row method in jubakit.Recommender, but there is no method in actual implementation.

Versions

  • jubakit==0.6.1

Remarks

  • I'm not sure whether jubakit supports all method of the client API, but it could be an essential function if we provide a "valid" top N recommendation in actual services.

Thank you in advance. : )

Classifier service does not allow schema without LABEL

Schema of Classifier service does not allow creating schema without LABEL column.

>>> from jubakit.classifier import Schema
>>> Schema({"name": Schema.STRING})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kenichi/Development/jubakit/jubakit/classifier.py", line 21, in __init__
    self._label_key = self._get_unique_mapping(mapping, fallback, self.LABEL, 'LABEL')
  File "/Users/kenichi/Development/jubakit/jubakit/base.py", line 123, in _get_unique_mapping
    raise RuntimeError('{0} key must be specified in schema'.format(name))
RuntimeError: LABEL key must be specified in schema

It was an intended spec at design time, but I think this restriction was unneeded.
It is rather inconvenient when using different Datasets for training and classification; Schema of classification Dataset does not require LABEL column.

remove Python 2.6 test in coverall.

Coverall doesn't support Python 2.6.

$ pip install coverall
DEPRECATION: Python 2.6 is no longer supported by the Python core team, please upgrade your Python. A future version of pip will drop support for Python 2.6
Collecting coverall
/home/daats/.pyenv/versions/2.6.9/lib/python2.6/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/home/daats/.pyenv/versions/2.6.9/lib/python2.6/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Could not find a version that satisfies the requirement coverall (from versions: )
No matching distribution found for coverall

Exception in service destructor

When running Service class is destroyed while terminating python process, sometimes the following message is displayed. We should avoid seeking already-closed log buffer.

Exception ignored in: <bound method BaseService.__del__ of <jubakit.classifier.Classifier object at 0x7f219fbd8320>>
Traceback (most recent call last):
  File "/home/jubatus/Development/jubakit/jubakit/base.py", line 422, in __del__
    backend.stop()
  File "/home/jubatus/Development/jubakit/jubakit/base.py", line 595, in stop
    self._logbuf.seek(0)
  File "/home/jubatus/.pyenv/versions/3.5.1/lib/python3.5/tempfile.py", line 483, in func_wrapper
    return func(*args, **kwargs)
ValueError: seek of closed file

jubamodel: add model transformation

While investigating or debugging models, sometimes I want to see raw results from its backend engine. For example, in NN-based classifier, it is hard to tell how scores are calculated, because we cannot see actual distances between registered points.

In such case, it is convenient if the backend NN model can be extracted from NN-based classifier model like this:

$ jubamodel --transform nearest_neighbor -o binary -O nn_model.jubatus classifier_model.jubatus

Then we can load the nn_model.jubatus file into jubanearest_neighbor and throw query like neighbor_row_from_datum to compute distances between points.

So, I propose adding --transform <service_name> option, which transforms the input model into desired backend model. Possible transformations may be:

  • classifier, regression, recommender, anomaly, clustering -> weight
  • classifier(NN), regression(NN), anomaly(light_lof) -> NN
  • anomaly(lof) -> recommender

Prototype: https://gist.github.com/kmaehashi/d85662929c3e2698daa9845fc516d1c1

Improve support of bool type

Currently, when mapping bool type entry to Schema.NUMBER, False is converted into 0. However Jubatus ignores 0 feature vectors. We should consider how to handle bool type.

Add visualization example

Visualization of training process is useful for model evaluation, i.e. validation accuracy curves.

Support parallel classification

Like as in scikit-learn (predict(..., n_jobs = 3)), it is better to support threaded classification so that experiments can be done faster.

Support persisting Dataset class

When doing a ML model evaluation, we often need to train-test the same dataset for multiple times, with different variations of feature extractions and hyper parameters.
It is better if Dataset class is serialized, so that we don't have to re-create the Dataset instance when repeatedly running experiments.

Integrate shell feature to jubakit

We decided to add command line feature to jubakit.
The feature is currently provided in 3rd party separate module: https://github.com/kmaehashi/jubash

TODOs are:

  • Port things to Python 3
  • Integrate with jubakit module
  • Write some tests
  • Write some docs

I'm thinking of integrating jubash into jubakit so that it can even be used like a "on-the-fly model debugging" feature. Actually this is what I want sometimes.

for (idx, row_id, flag, score) in anomaly.calc_score(dataset):
  if score == float('nan'):
    anomaly.shell()  # drop into interactive shell; the process continues when shell exits.

Why raised RPC error of 'server config mismatched'?

Hello,
I'm trying jubakit now.
I want to backup and restore for training data.
But jubakit raised RPC error of 'server config mismatched'!
I use 'jubakit.anomaly.Config()' always.
I cannot resolve it.

Please teach.

Add command line interface

I want something like this, so that no coding is required.

jubakit classifier {--train | --classify | --cross-validate 3} {--csv "iris.csv" | --db "psql://user:password@localhost/dbname/tablename"} --label "Species" [--schema ...]

Transforming label key in a Schema definition will cause RuntimeError in extracting a training result with 'in' operator.

Problem Summary

  • If we transform a label key in a Schema definition, it causes RuntimeError in the subsequent training process.
  • This RuntimeError will happen if we extract training results via 'in' operator.

How to reproduce the error

  • Install jubakit==0.6.1

Define a schema with a label for a classifier

EmailSchema = Schema(
    {
        'category': (Schema.LABEL, 'category'), # This transform cause error.
        'subject': (Schema.STRING, 'subject'),
        'body': (Schema.STRING, 'text'),
        'attachment_names': (Schema.STRING, 'attachments'),
    }, fallback=Schema.IGNORE
)

And extract the training result on main.py.

...
loader = CustomLoader(data=training_data)
        dataset = Dataset(loader=loader, schema=EmailSchema).shuffle()
        for (idx, label) in self.service.train(dataset):
	  # CAUSE RuntimeError.
...

Crash traceback

Traceback (most recent call last):
...
  File "/Users/user/MyAppDir/venv/lib/python3.6/site-packages/jubakit/classifier.py", line 140, in train
    for (idx, (label, d)) in dataset:
  File "/Users/user/MyAppDir/venv/lib/python3.6/site-packages/jubakit/base.py", line 405, in __iter__
    yield (self._index, self._schema.transform(row))
  File "/Users/user/MyAppDir/venv/lib/python3.6/site-packages/jubakit/classifier.py", line 32, in transform
    d = self._transform_as_datum(row, None, [self._label_key])
  File "/Users/user/MyAppDir/venv/lib/python3.6/site-packages/jubakit/base.py", line 198, in _transform_as_datum
    self._add_to_datum(d, key_type, key_name, value)
  File "/Users/user/MyAppDir/venv/lib/python3.6/site-packages/jubakit/base.py", line 180, in _add_to_datum
    raise RuntimeError('invalid type {0} for key {1}'.format(t, k))
RuntimeError: invalid type l for key category

Remarks

  • If we don't extract the training result, the error won't happen.
...
loader = CustomLoader(data=training_data)
        dataset = Dataset(loader=loader, schema=EmailSchema).shuffle()
        self.service.train(dataset)  # It will work.
...

I'm not sure those behaviors are a bug or intended one, So I opened this issue.
Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.