cesium-ml / cesium Goto Github PK
View Code? Open in Web Editor NEWMachine Learning Time-Series Platform
License: Other
Machine Learning Time-Series Platform
License: Other
Currently the user sees a generic "An error occurred while processing your request. Please try again at a later time." when trying to build a model using features that have invalid values (i.e., values that scikit-learn doesn't accept). Should either fail more verbosely or, preferably, retrain without those features and warn the user that they were omitted.
Running the custom feature tests without first pulling or building the required docker images gives an error like: APIError: 500 Server Error: Internal Server Error ("json: cannot unmarshal string into Go value of type struct {}")
. Should catch this somewhere higher up and present an error that relates to docker somehow.
E.g., "/foo".
From an e-mail Josh wrote earlier:
Table 4 and 5:
http://iopscience.iop.org/0004-637X/733/1/10/pdf/apj_733_1_10.pdf http://iopscience.iop.org/0004-637X/733/1/10/pdf/apj_733_1_10.pdf
(with a few links to other papers for those, including Stetson indices)
a good extension can be found in papers referencing this one (http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=2011ApJ...733...10R&refs=CITATIONS&db_key=AST http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=2011ApJ...733...10R&refs=CITATIONS&db_key=AST). Eg. this paper looks interesting http://mnras.oxfordjournals.org/content/437/1/147.full.pdf http://mnras.oxfordjournals.org/content/437/1/147.full.pdf
Something to be aware of: starting in 0.18.0, sklearn.grid_search
is deprecated:
In [6]: from sklearn import grid_search
sklearn/cross_validation.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
sklearn/grid_search.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
Not worth holding up #128 but it's worth keeping in mind; we will probably want to switch to the new interface as soon as it's available before we have too much code that uses the old versions.
Currently the error messages written to the log file are very uninformative, along the lines of "An error has occurred." The actual text of the exception is usually printed to stdout, which is usually lost forever. This is problematic for testing and also will eventually be a problem in production.
Simplifies configuration.
Right now, the plots are generated on the front end; as we move towards a new plotting framework, it would be nice to have pure Python functions that generate feature summary plots from the backend. They'd make for nice additions to example notebooks.
I did a quick benchmark of our Lomb-Scargle implementation versus the ones in gatspy
.
N = 2001
times = np.sort(np.random.random(size=N))
values = 8 * np.sin(2*np.pi*times*5.3 + 0.1) + np.random.normal(scale=0.1, size=N) # sinusoidal-ish
errors = np.random.exponential(scale=0.1, size=N)
# Use parameters comparable to our LS implementation
opt_args = {'period_range': (1./33., times.max()), 'quiet': True}
#8-harmonic model with regularization (our default)
%timeit gatspy.periodic.LombScargle(Nterms=8, regularization=1., regularize_by_trace=False, fit_period=True, optimizer_kwds=opt_args).fit(times, values, errors)
#1-harmonic model without regularization (Fast version)
%timeit gatspy.periodic.LombScargleFast(fit_period=True, optimizer_kwds=opt_args).fit(times, values, errors)
# Use roughly the same number of grid points as gatspy chooses
numf = 1500
df = (33 - 1./times.max()) / numf
#8-harmonic model with regularization (our default)
%timeit fit_lomb_scargle(times, values, errors, 1/times.max(), df, numf)
#1-harmonic model without regularization (Fast version)
%timeit fit_lomb_scargle(times, values, errors, 1/times.max(), df, numf, lambda0=0., nharm=1)
1 loops, best of 3: 962 ms per loop # gatspy, full model
100 loops, best of 3: 22 ms per loop # gatspy, simple model
1 loops, best of 3: 888 ms per loop # ours, full model
1 loops, best of 3: 239 ms per loop # ours, simple model
The LombScargleFast
function from gatspy
, which fits a much simpler model to the data, is much faster than our C implementation. I think it would be good to add some features based on LombScargleFast
, since they would be faster and easier to understand for non-expert users. We could leave in the existing implementation as it computes some additional statistical quantities which the gatspy
functions do not, but these seem like features that only really expert users would be interested in.
Thoughts?
Share on Google Drive
I was troubleshooting my Docker installation but it looks like everything is set up correctly and the issue is a bit more complicated: on OS X, Docker is actually running inside a Linux VM, so in particular there is no local /var/run/docker.sock
file as is referenced in util.py
.
Happy to look into this if you like, I could stand to get a bit more familiar with Docker...
Right now everything is decomposed in a way that makes sense from the perspective of doing computation driven by requests from the front end. But trying to call our code directly to analyze a dataset is kind of a nightmare: the functions for featurization, model building, and prediction all operate on file paths where either data or some output from another task is stored. So any task that wants to make use of just part of our pipeline (without conforming entirely to our way of handling data) is basically impossible.
It would be much better if our back end was composed of more elementary functions that were usable in their own right (I don't know exactly what that would look like yet). Right now I'd say our API is useless for anything except an exact reproduction of a workflow from the front end.
Users may want to pip install mltsp
without installing and running RabbitMQ & Celery, & MLTSP should still be able to do everything serially.
๐ผ ?
Of the form:
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 51132)
Traceback (most recent call last):
File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python2.7/SocketServer.py", line 657, in __init__
self.finish()
File "/usr/lib/python2.7/SocketServer.py", line 716, in finish
self.wfile.close()
File "/usr/lib/python2.7/socket.py", line 283, in close
self.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
----------------------------------------
Tooltips? Clickable links that create jQuery dialogs? Also, more generally add more explanatory elements.
It would be nice to have a module (mltsp.datasets
?) that allows for easy fetching of a few sample datasets; any tutorials we write up could then use these built-in functions rather than requiring the user to manually download the data.
asas_training_set.tar.gz
in mltsp/data/sample_data
isn't being used by any tests; we could migrate it out of the repo and instead make it downloadable through mltsp.datasets
? EDIT: we could also just keep the relevant data in the repo, asas_training_set.tar.gz
is ~3MB and my EEG dataset is ~6MB, dunno if that's too large or not.Ultimately, this type of scaling should happen automatically.
Used in docker extraction, see custom_feature_tools
We should start setting up a neuroscience ML workflow. A good starting point might be from the data/examples from nitime.
@fperez: we could use some help in formulating an interesting ML classification/regression problem out of these nitime examples.
Currently we take in a data archive, extract and featurize it, and then immediately delete everything. It would be nice to:
In mltsp.Flask.flask_app
, errors that occur in calls run in a subprocess do not propagate up to the main app
Add checks such that a dataset cannot be delete if there are results that depend on that dataset, or at least warn the user about the consequences before proceeding.
It's currently down, so it might be worth considering backups or more reliable third party hosting such as ReadTheDocs.
Reminder to myself to go back and fix this
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.