cesium-ml / cesium Goto Github PK

View Code? Open in Web Editor NEW

669.0 669.0 101.0 21.98 MB

Machine Learning Time-Series Platform

License: Other

Shell 0.05% Makefile 0.21% C 7.66% Python 90.64% Dockerfile 0.33% Cython 1.11%

cesium's People

Contributors

Stargazers

Watchers

Forkers

stefanv acrellin bnaul gitter-badger kod3r profjsb semanticbeeng bengee308 ajijohn benjamesbabala paulhendricks robinbing bsipocz adantra arkwave alejandrogarciasalas weiwzhang sonajeswani jiesuncal guerrajorge chaorun henryslzhao rca32 rpatil524 brunston razprince623 ipsolar jayden11 tomfisher dimitar-petrov zhaojunhong mathematixy amandabailey hiredd lcota filmor boragocode tony32769 lokeshgithub chetanmehra narp123 stevenlol jiwoncpark biodun shubhampachori12110095 sunhongyue4500 cmiddleton-osmre zgolkhou jaykimbravekjh serenidpity sarajamal57 jorgemarpa mohitsahunitrr brandondevans intuitionmachine snowfox1939 databill86 zwickytransientfacility george3d6 vishalbelsare huangzonghui snowrunfly theblacksignal irfnrdh laokpa yxw027 nitin0301 youjp vivicoco shamimrezasohag wwwk dongzhu afcarl snowman828 quannabe vincentwei2021 scalet98 yanncabanes maldil knut0815 alihaskar hongshibao ioannisgkouzionis rohansaphal97 bfhealy dhockaday valeman chaoecohydrors cesium-ml jrkagumba arpitjain799 mandaro-dev rio-cai ellanaca my406 robinxlu

cesium's Issues

Use LombScargleFast from gatspy?

I did a quick benchmark of our Lomb-Scargle implementation versus the ones in gatspy.

Benchmark

N = 2001
times = np.sort(np.random.random(size=N))
values = 8 * np.sin(2*np.pi*times*5.3 + 0.1) + np.random.normal(scale=0.1, size=N)  # sinusoidal-ish
errors = np.random.exponential(scale=0.1, size=N)

# Use parameters comparable to our LS implementation
opt_args = {'period_range': (1./33., times.max()), 'quiet': True}
#8-harmonic model with regularization (our default)
%timeit gatspy.periodic.LombScargle(Nterms=8, regularization=1., regularize_by_trace=False, fit_period=True, optimizer_kwds=opt_args).fit(times, values, errors)
#1-harmonic model without regularization (Fast version)
%timeit gatspy.periodic.LombScargleFast(fit_period=True, optimizer_kwds=opt_args).fit(times, values, errors)

# Use roughly the same number of grid points as gatspy chooses
numf = 1500
df = (33 - 1./times.max()) / numf
#8-harmonic model with regularization (our default)
%timeit fit_lomb_scargle(times, values, errors, 1/times.max(), df, numf)
#1-harmonic model without regularization (Fast version)
%timeit fit_lomb_scargle(times, values, errors, 1/times.max(), df, numf, lambda0=0., nharm=1)

Output

1 loops, best of 3:     962 ms per loop        # gatspy, full model
100 loops, best of 3:    22 ms per loop        # gatspy, simple model
1 loops, best of 3:     888 ms per loop        # ours, full model
1 loops, best of 3:     239 ms per loop        # ours, simple model

Conclusion

The LombScargleFast function from gatspy, which fits a much simpler model to the data, is much faster than our C implementation. I think it would be good to add some features based on LombScargleFast, since they would be faster and easier to understand for non-expert users. We could leave in the existing implementation as it computes some additional statistical quantities which the gatspy functions do not, but these seem like features that only really expert users would be interested in.

Thoughts?

Check Web site: http://mltsp.io/

It's currently down, so it might be worth considering backups or more reliable third party hosting such as ReadTheDocs.

Review PredictionIO and whether it can be of any use

Slow doc build with `pandas` installed

🐼 ?

Produce 30 to 60s overview video

Make a Logo!

Model selection changes in `sklearn`

Something to be aware of: starting in 0.18.0, sklearn.grid_search is deprecated:

In [6]: from sklearn import grid_search
sklearn/cross_validation.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
sklearn/grid_search.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Not worth holding up #128 but it's worth keeping in mind; we will probably want to switch to the new interface as soon as it's available before we have too much code that uses the old versions.

Illegal filenames allowed in project creation

E.g., "/foo".

Release v0.2

Contact sklearn developers about better ways of saving models to disk

Improve API for featurization/model building/prediction

Right now everything is decomposed in a way that makes sense from the perspective of doing computation driven by requests from the front end. But trying to call our code directly to analyze a dataset is kind of a nightmare: the functions for featurization, model building, and prediction all operate on file paths where either data or some output from another task is stored. So any task that wants to make use of just part of our pipeline (without conforming entirely to our way of handling data) is basically impossible.

It would be much better if our back end was composed of more elementary functions that were usable in their own right (I don't know exactly what that would look like yet). Right now I'd say our API is useless for anything except an exact reproduction of a workflow from the front end.

Investigate performance of disco/queue/batch

Cryptic error when missing docker images

Running the custom feature tests without first pulling or building the required docker images gives an error like: APIError: 500 Server Error: Internal Server Error ("json: cannot unmarshal string into Go value of type struct {}"). Should catch this somewhere higher up and present an error that relates to docker somehow.

Produce slide on Docker testing infra-structure

Share on Google Drive

Use ReactJS to render data in front-end

Review the current feature extractor code & add tests

Put `run_script_in_container` in a more sane location

Sample datasets module

It would be nice to have a module (mltsp.datasets?) that allows for easy fetching of a few sample datasets; any tutorials we write up could then use these built-in functions rather than requiring the user to manually download the data.

Is it ok to download data from public URLs out of our control or should we host them ourselves?
asas_training_set.tar.gz in mltsp/data/sample_data isn't being used by any tests; we could migrate it out of the repo and instead make it downloadable through mltsp.datasets? EDIT: we could also just keep the relevant data in the repo, asas_training_set.tar.gz is ~3MB and my EEG dataset is ~6MB, dunno if that's too large or not.

Add functionality for model hyperparameter optimization & add corresponding input fields in browser UI

Implement a tool to provision & scale disco (now celery) worker farm

Ultimately, this type of scaling should happen automatically.

Escape all user-defined strings displayed in browser for security

Consider these features for inclusion

From an e-mail Josh wrote earlier:

Table 4 and 5:
http://iopscience.iop.org/0004-637X/733/1/10/pdf/apj_733_1_10.pdf http://iopscience.iop.org/0004-637X/733/1/10/pdf/apj_733_1_10.pdf

(with a few links to other papers for those, including Stetson indices)

a good extension can be found in papers referencing this one (http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=2011ApJ...733...10R&refs=CITATIONS&db_key=AST http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=2011ApJ...733...10R&refs=CITATIONS&db_key=AST). Eg. this paper looks interesting http://mnras.oxfordjournals.org/content/437/1/147.full.pdf http://mnras.oxfordjournals.org/content/437/1/147.full.pdf

Check if Celery worker server is running, and if not, eschew Celery and execute tasks in serial

Users may want to pip install mltsp without installing and running RabbitMQ & Celery, & MLTSP should still be able to do everything serially.

Docker in OS X

I was troubleshooting my Docker installation but it looks like everything is set up correctly and the issue is a bit more complicated: on OS X, Docker is actually running inside a Linux VM, so in particular there is no local /var/run/docker.sock file as is referenced in util.py.

Happy to look into this if you like, I could stand to get a bit more familiar with Docker...

Review installation instructions

Investigate alternatives to Disco for dispatch + distributed file-system

Keep uploaded TS datasets on disk & store file URI in database for future use

Add checks such that a dataset cannot be delete if there are results that depend on that dataset, or at least warn the user about the consequences before proceeding.

Use more descriptive model names instead of acronyms

Develop/Make Architectural Diagram

Improve file handling / add tools for data management

Currently we take in a data archive, extract and featurize it, and then immediately delete everything. It would be nice to:

Save datasets so they can be featurized again without reuploading
Provide some sort of interface to see / manipulate the datasets that have been uploaded
Separate logic of file handling from the backend computation libraries

Add docstrings to functions wherever missing

Make StormPath an optional dependency & add Flask-User support

Simplifies configuration.

Populate form fields using JS in frontend, do not generate HTML code in backend

Python3 Compatibility

Seems like we're pretty close at this point, should we go ahead and try to fix the remaining few problems? @stefanv, did you want to take care of this or should @acrellin or I?

Remove & refactor use of project_path configuration variable

Used in docker extraction, see custom_feature_tools

Fix exceptions raised during Casper tests

Of the form:

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 51132)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 657, in __init__
    self.finish()
  File "/usr/lib/python2.7/SocketServer.py", line 716, in finish
    self.wfile.close()
  File "/usr/lib/python2.7/socket.py", line 283, in close
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 307, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
----------------------------------------

Add coverage to test suite

Propagate errors from subprocesses to main process

In mltsp.Flask.flask_app, errors that occur in calls run in a subprocess do not propagate up to the main app

Write a command-line interface to the REST API

Set up some Neuroscience examples

We should start setting up a neuroscience ML workflow. A good starting point might be from the data/examples from nitime.

@fperez: we could use some help in formulating an interesting ML classification/regression problem out of these nitime examples.

Allocate resources for online hosting (potentially Azure)

Research high-level plotting library for visualization on top of d3js

http://vega.github.io/
http://www.highcharts.com/demo

Add feature plots to library

Right now, the plots are generated on the front end; as we move towards a new plotting framework, it would be nice to have pure Python functions that generate feature summary plots from the backend. They'd make for nice additions to example notebooks.

Add descriptions of required file formats in browser

Tooltips? Clickable links that create jQuery dialogs? Also, more generally add more explanatory elements.

Research a flexible websocket / SSE based backend + frontend coupling

Handle NaN or infinite feature values

Currently the user sees a generic "An error occurred while processing your request. Please try again at a later time." when trying to build a model using features that have invalid values (i.e., values that scikit-learn doesn't accept). Should either fail more verbosely or, preferably, retrain without those features and warn the user that they were omitted.

Use new string formatting syntax throughout

Use docker_tools whenever Docker is invoked

Reminder to myself to go back and fix this

Improve exception logging in Flask app

Currently the error messages written to the log file are very uninformative, along the lines of "An error has occurred." The actual text of the exception is usually printed to stdout, which is usually lost forever. This is problematic for testing and also will eventually be a problem in production.