Coder Social home page Coder Social logo

cesium's People

Contributors

acrellin avatar bfhealy avatar bnaul avatar filmor avatar gitter-badger avatar hatsy avatar jiesuncal avatar jrkagumba avatar maldil avatar milesial avatar profjsb avatar stefanv avatar yanncabanes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cesium's Issues

Handle NaN or infinite feature values

Currently the user sees a generic "An error occurred while processing your request. Please try again at a later time." when trying to build a model using features that have invalid values (i.e., values that scikit-learn doesn't accept). Should either fail more verbosely or, preferably, retrain without those features and warn the user that they were omitted.

Cryptic error when missing docker images

Running the custom feature tests without first pulling or building the required docker images gives an error like: APIError: 500 Server Error: Internal Server Error ("json: cannot unmarshal string into Go value of type struct {}"). Should catch this somewhere higher up and present an error that relates to docker somehow.

Python3 Compatibility

Seems like we're pretty close at this point, should we go ahead and try to fix the remaining few problems? @stefanv, did you want to take care of this or should @acrellin or I?

Consider these features for inclusion

Model selection changes in `sklearn`

Something to be aware of: starting in 0.18.0, sklearn.grid_search is deprecated:

In [6]: from sklearn import grid_search
sklearn/cross_validation.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
sklearn/grid_search.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Not worth holding up #128 but it's worth keeping in mind; we will probably want to switch to the new interface as soon as it's available before we have too much code that uses the old versions.

Improve exception logging in Flask app

Currently the error messages written to the log file are very uninformative, along the lines of "An error has occurred." The actual text of the exception is usually printed to stdout, which is usually lost forever. This is problematic for testing and also will eventually be a problem in production.

Add feature plots to library

Right now, the plots are generated on the front end; as we move towards a new plotting framework, it would be nice to have pure Python functions that generate feature summary plots from the backend. They'd make for nice additions to example notebooks.

Use LombScargleFast from gatspy?

I did a quick benchmark of our Lomb-Scargle implementation versus the ones in gatspy.

Benchmark

N = 2001
times = np.sort(np.random.random(size=N))
values = 8 * np.sin(2*np.pi*times*5.3 + 0.1) + np.random.normal(scale=0.1, size=N)  # sinusoidal-ish
errors = np.random.exponential(scale=0.1, size=N)

# Use parameters comparable to our LS implementation
opt_args = {'period_range': (1./33., times.max()), 'quiet': True}
#8-harmonic model with regularization (our default)
%timeit gatspy.periodic.LombScargle(Nterms=8, regularization=1., regularize_by_trace=False, fit_period=True, optimizer_kwds=opt_args).fit(times, values, errors)
#1-harmonic model without regularization (Fast version)
%timeit gatspy.periodic.LombScargleFast(fit_period=True, optimizer_kwds=opt_args).fit(times, values, errors)

# Use roughly the same number of grid points as gatspy chooses
numf = 1500
df = (33 - 1./times.max()) / numf
#8-harmonic model with regularization (our default)
%timeit fit_lomb_scargle(times, values, errors, 1/times.max(), df, numf)
#1-harmonic model without regularization (Fast version)
%timeit fit_lomb_scargle(times, values, errors, 1/times.max(), df, numf, lambda0=0., nharm=1)

Output

1 loops, best of 3:     962 ms per loop        # gatspy, full model
100 loops, best of 3:    22 ms per loop        # gatspy, simple model
1 loops, best of 3:     888 ms per loop        # ours, full model
1 loops, best of 3:     239 ms per loop        # ours, simple model

Conclusion

The LombScargleFast function from gatspy, which fits a much simpler model to the data, is much faster than our C implementation. I think it would be good to add some features based on LombScargleFast, since they would be faster and easier to understand for non-expert users. We could leave in the existing implementation as it computes some additional statistical quantities which the gatspy functions do not, but these seem like features that only really expert users would be interested in.

Thoughts?

Docker in OS X

I was troubleshooting my Docker installation but it looks like everything is set up correctly and the issue is a bit more complicated: on OS X, Docker is actually running inside a Linux VM, so in particular there is no local /var/run/docker.sock file as is referenced in util.py.

Happy to look into this if you like, I could stand to get a bit more familiar with Docker...

Improve API for featurization/model building/prediction

Right now everything is decomposed in a way that makes sense from the perspective of doing computation driven by requests from the front end. But trying to call our code directly to analyze a dataset is kind of a nightmare: the functions for featurization, model building, and prediction all operate on file paths where either data or some output from another task is stored. So any task that wants to make use of just part of our pipeline (without conforming entirely to our way of handling data) is basically impossible.

It would be much better if our back end was composed of more elementary functions that were usable in their own right (I don't know exactly what that would look like yet). Right now I'd say our API is useless for anything except an exact reproduction of a workflow from the front end.

Fix exceptions raised during Casper tests

Of the form:

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 51132)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 657, in __init__
    self.finish()
  File "/usr/lib/python2.7/SocketServer.py", line 716, in finish
    self.wfile.close()
  File "/usr/lib/python2.7/socket.py", line 283, in close
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 307, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
----------------------------------------

See also: http://stackoverflow.com/a/22560461/214686

Sample datasets module

It would be nice to have a module (mltsp.datasets?) that allows for easy fetching of a few sample datasets; any tutorials we write up could then use these built-in functions rather than requiring the user to manually download the data.

  1. Is it ok to download data from public URLs out of our control or should we host them ourselves?
  2. asas_training_set.tar.gz in mltsp/data/sample_data isn't being used by any tests; we could migrate it out of the repo and instead make it downloadable through mltsp.datasets? EDIT: we could also just keep the relevant data in the repo, asas_training_set.tar.gz is ~3MB and my EEG dataset is ~6MB, dunno if that's too large or not.

Improve file handling / add tools for data management

Currently we take in a data archive, extract and featurize it, and then immediately delete everything. It would be nice to:

  • Save datasets so they can be featurized again without reuploading
  • Provide some sort of interface to see / manipulate the datasets that have been uploaded
  • Separate logic of file handling from the backend computation libraries

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.