Coder Social home page Coder Social logo

scidata's People

Contributors

ahamez avatar cocoa-xu avatar goodhamgupta avatar jeantux avatar josevalim avatar msluszniak avatar nallwhy avatar seanmor5 avatar t-rutten avatar wojtekmach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scidata's Issues

Use CA Store from Hex or OS?

Hello

Thanks for providing this package!

I am new Elixir and kind of confused that you download a CA bundle in such a high level package.

Is it possible to use the one Hex is using? Or the one provided by the operating system?

cacertfile: CAStore.file_path(),

Improve HTTP client

Downloading a dataset with OTP >= 24 yields this warning:

12:27:18.691 [warning] Description: 'Authenticity is not established by certificate path validation'
     Reason: 'Option {verify, verify_peer} and cacertfile/cacerts is missing'

We should consider swapping :httpc with a different client to leverage reasonable SSL defaults (e.g. CA cert files). Two candidates for a new client are :req and :hackney. The former is a reasonable choice given the pipeline that manipulates a request/response tuple in Utils.get!.

Add NLP related datasets

With the addition of RNNs in Axon, I'd like to start adding some examples of using them in practice. @t-rutten would you be open to adding some NLP related datasets to the repository?

Make IMDB Reviews dataset consistent

Currently they return a map and still support transforms. Is it possible to normalize them to a similar result as the other datasets? For example, return a tuple of {{input_binary, input_type, input_shape}, {label_binary, label_type, label_shape}}?

Input pipeline API

Related to discussion in #11 and would resolve elixir-nx/axon#25

I have been digging in to input pipelines, specifically tf.data and torch.utils.data.DataLoader. I'm more familiar with tf.data, and it's got a more intentionally functional pattern, so I'm mostly biased towards an API very similar to theirs. Here's a recent paper on tf.data as well.

The goal of an efficient input pipeline is to keep accelerators as busy as possible. There are a lot of places bottlenecks can happen (large IO operations, data transfer to GPU, slow input transformations, etc.) so it's important that any implementation be as performance sensitive as possible. tf.data has some interesting benchmarks where they simulate an "infinitely fast" neural network to measure the absolute throughput of their API and I believe they achieve something like 13k images processed per second - it would be interesting to replicate some of these benchmarks, but I won't get ahead of myself.

Input pipelines can be characterized in 3 phases: Extract. Transform. Load. I'll briefly summarize the stages and their challenges.

Extract

This stage is reading data from storage - think loading images from directories or streaming text from files. It's heavily IO bound and slow; however, because most practical datasets are massive, loading the entire dataset into memory is impractical.

Transform

This stage is applying transformations and preprocessing to the input data. This could be anything from image augmentation to applying masks, padding, etc. Most operations are compute intensive; however, because the accelerator should be busy doing the actual training, transformation work is most efficiently offloaded to the CPU.

Load

This stage actually loads data into accelerators. Transferring data from CPU to an accelerator can prove costly, but there are some tricks such as staging / prefetching input buffers to improve performance.

The tf.data main abstraction is the tf.data.Dataset which represents a stateless input pipeline. The tf.data.Dataset is analogous to an Elixir Stream. It can be transformed with functions such as filter, flat_map, batch, map, reduce, etc. and these transformations are fused into a graph that can then be statically optimized. The input pipeline also offers the ability to "prefetch" data (stage for efficient transfer to accelerators), cache data so it's read from memory or faster storage later on, as well as "dynamic" optimizations that tune the "parallelism" and memory usage of the pipeline.

Based on the considerations above, I propose we create an input pipeline abstraction very similar to tf.data based on Streams. Here are my initial thoughts:

First, we can define a struct that stores the actual input / label stream as well as metadata:

defstruct :input, :label, :input_shape, :input_type, :label_shape, :label_type, :supervised

So we can capture shape / type information if necessary. :supervised is true in cases where labels are present and false when they are not.

We'll then have a number of "extract" methods that return new pipelines from a variety of formats. It should also be trivial to create new "extract" methods, but we should cover the most common cases and ensure we have them as optimized as possible:

from_stream(stream, opts \\ []) :: pipeline
from_files(files, opts \\ []) :: pipeline
...

We'll also have a number of transformations. I think we might be able to do some static optimizations and fusions of our own, and we can ensure each transformation is jitted by default.

batch(pipeline, batch_size) :: pipeline
map(pipeline, map_fn) :: pipeline
filter(pipeline, filter_fn) :: pipeline
repeat(pipeline, repeat_size) :: pipeline
shuffle(pipeline, shuffle_size) :: pipeline

We'll also want some "performance" based functions, although I haven't really thought about how these can be most efficiently implemented:

prefetch(pipeline, size) :: pipeline
cache(pipeline, size) :: pipeline

And ways to lazily iterate through the dataset, although this can be done directly on input and label streams most likely. This kind of application IMO suits Elixir very well. I'm not really experienced with GenStage, Broadway, Flow, etc. so maybe they are useful here and somebody else can comment or maybe they are irrelevant and I'll not bring them up again :)

I believe the responsibility of building out this API probably best fits in this library, unless we want to limit the purpose of Scidata to just focusing on datasets and move the pipeline logic elsewhere. Pending any feedback, I can start putting some basic things together in a PR, and then begin working integrating it with Axon.

Iris and Wine dataset download fails with Bad Certificate

I don't know if there's much you can do about it but trying to download the Iris or Wine datasets fails with a bad certificate error. All the other datasets download correctly.
I think there might actually be an issue with the certificate at archive.ics.uci.edu analysis here

I created a clean empty elixir project with scidata as a single dependency and got the following errors:

iex(3)> Scidata.Iris.download           

15:06:37.693 [notice] TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate

** (RuntimeError) {:failed_connect, [{:to_address, {~c"archive.ics.uci.edu", 443}}, {:inet, [:inet], {:tls_alert, {:bad_certificate, ~c"TLS client: In state wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate\n"}}}]}
    (scidata 0.1.10) lib/scidata/utils.ex:54: Scidata.Utils.run!/1
    (scidata 0.1.10) lib/scidata/utils.ex:12: Scidata.Utils.get!/2
    (scidata 0.1.10) lib/scidata/iris.ex:50: Scidata.Iris.download/1
    iex:3: (file)
iex(3)> Scidata.Wine.download

15:08:45.423 [notice] TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate

** (RuntimeError) {:failed_connect, [{:to_address, {~c"archive.ics.uci.edu", 443}}, {:inet, [:inet], {:tls_alert, {:bad_certificate, ~c"TLS client: In state wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate\n"}}}]}
    (scidata 0.1.10) lib/scidata/utils.ex:54: Scidata.Utils.run!/1
    (scidata 0.1.10) lib/scidata/utils.ex:12: Scidata.Utils.get!/2
    (scidata 0.1.10) lib/scidata/wine.ex:59: Scidata.Wine.download/1
    iex:3: (file)

`StbImage` is undefined

Hey ๐Ÿ‘‹

I was trying to fidget with Nx and Axon today but it seems like the new 0.1.6 release is broken (Released just today) due to a missing struct called StbImage.

== Compilation error in file lib/scidata/caltech101.ex ==
** (CompileError) lib/scidata/caltech101.ex:153: StbImage.__struct__/0 is undefined, cannot expand struct StbImage. Make sure the struct name is correct. If the struct name exists and is correct but it still cannot be found, you likely have cyclic module usage in your code
    (elixir 1.13.4) expanding macro: Kernel.|>/2

Thank you for the great work ๐Ÿ‘

Update stb_image to 0.4

The API has changed since then. I think we can return the underlying StbImage structs from v0.4!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.