elixir-nx / scidata Goto Github PK

View Code? Open in Web Editor NEW

160.0 160.0 13.0 88 KB

Download and normalize datasets related to science

License: Apache License 2.0

Elixir 100.00%

scidata's Issues

Use CA Store from Hex or OS?

Hello

Thanks for providing this package!

I am new Elixir and kind of confused that you download a CA bundle in such a high level package.

Is it possible to use the one Hex is using? Or the one provided by the operating system?

scidata/lib/scidata/utils.ex

Line 40 in ea7a488

cacertfile: CAStore.file_path(),

Improve HTTP client

Downloading a dataset with OTP >= 24 yields this warning:

12:27:18.691 [warning] Description: 'Authenticity is not established by certificate path validation'
     Reason: 'Option {verify, verify_peer} and cacertfile/cacerts is missing'

We should consider swapping :httpc with a different client to leverage reasonable SSL defaults (e.g. CA cert files). Two candidates for a new client are :req and :hackney. The former is a reasonable choice given the pipeline that manipulates a request/response tuple in Utils.get!.

Path to Wine and Iris changed

UCI updated their website today and links to the Iris and Wine datasets are no longer valid.

Add NLP related datasets

With the addition of RNNs in Axon, I'd like to start adding some examples of using them in practice. @t-rutten would you be open to adding some NLP related datasets to the repository?

huggingface

https://huggingface.co/ this has been quite popular lately.

Maybe something to look at for scidata / axon.

Make IMDB Reviews dataset consistent

Currently they return a map and still support transforms. Is it possible to normalize them to a similar result as the other datasets? For example, return a tuple of {{input_binary, input_type, input_shape}, {label_binary, label_type, label_shape}}?

Input pipeline API

Related to discussion in #11 and would resolve elixir-nx/axon#25

I have been digging in to input pipelines, specifically tf.data and torch.utils.data.DataLoader. I'm more familiar with tf.data, and it's got a more intentionally functional pattern, so I'm mostly biased towards an API very similar to theirs. Here's a recent paper on tf.data as well.

The goal of an efficient input pipeline is to keep accelerators as busy as possible. There are a lot of places bottlenecks can happen (large IO operations, data transfer to GPU, slow input transformations, etc.) so it's important that any implementation be as performance sensitive as possible. tf.data has some interesting benchmarks where they simulate an "infinitely fast" neural network to measure the absolute throughput of their API and I believe they achieve something like 13k images processed per second - it would be interesting to replicate some of these benchmarks, but I won't get ahead of myself.

Input pipelines can be characterized in 3 phases: Extract. Transform. Load. I'll briefly summarize the stages and their challenges.

Extract

This stage is reading data from storage - think loading images from directories or streaming text from files. It's heavily IO bound and slow; however, because most practical datasets are massive, loading the entire dataset into memory is impractical.

Transform

This stage is applying transformations and preprocessing to the input data. This could be anything from image augmentation to applying masks, padding, etc. Most operations are compute intensive; however, because the accelerator should be busy doing the actual training, transformation work is most efficiently offloaded to the CPU.

Load

This stage actually loads data into accelerators. Transferring data from CPU to an accelerator can prove costly, but there are some tricks such as staging / prefetching input buffers to improve performance.

The tf.data main abstraction is the tf.data.Dataset which represents a stateless input pipeline. The tf.data.Dataset is analogous to an Elixir Stream. It can be transformed with functions such as filter, flat_map, batch, map, reduce, etc. and these transformations are fused into a graph that can then be statically optimized. The input pipeline also offers the ability to "prefetch" data (stage for efficient transfer to accelerators), cache data so it's read from memory or faster storage later on, as well as "dynamic" optimizations that tune the "parallelism" and memory usage of the pipeline.

Based on the considerations above, I propose we create an input pipeline abstraction very similar to tf.data based on Streams. Here are my initial thoughts:

First, we can define a struct that stores the actual input / label stream as well as metadata:

defstruct :input, :label, :input_shape, :input_type, :label_shape, :label_type, :supervised

So we can capture shape / type information if necessary. :supervised is true in cases where labels are present and false when they are not.

We'll then have a number of "extract" methods that return new pipelines from a variety of formats. It should also be trivial to create new "extract" methods, but we should cover the most common cases and ensure we have them as optimized as possible:

from_stream(stream, opts \\ []) :: pipeline
from_files(files, opts \\ []) :: pipeline
...

We'll also have a number of transformations. I think we might be able to do some static optimizations and fusions of our own, and we can ensure each transformation is jitted by default.

batch(pipeline, batch_size) :: pipeline
map(pipeline, map_fn) :: pipeline
filter(pipeline, filter_fn) :: pipeline
repeat(pipeline, repeat_size) :: pipeline
shuffle(pipeline, shuffle_size) :: pipeline

We'll also want some "performance" based functions, although I haven't really thought about how these can be most efficiently implemented:

prefetch(pipeline, size) :: pipeline
cache(pipeline, size) :: pipeline

And ways to lazily iterate through the dataset, although this can be done directly on input and label streams most likely. This kind of application IMO suits Elixir very well. I'm not really experienced with GenStage, Broadway, Flow, etc. so maybe they are useful here and somebody else can comment or maybe they are irrelevant and I'll not bring them up again :)

I believe the responsibility of building out this API probably best fits in this library, unless we want to limit the purpose of Scidata to just focusing on datasets and move the pipeline logic elsewhere. Pending any feedback, I can start putting some basic things together in a PR, and then begin working integrating it with Axon.

Iris and Wine dataset download fails with Bad Certificate

I don't know if there's much you can do about it but trying to download the Iris or Wine datasets fails with a bad certificate error. All the other datasets download correctly.
I think there might actually be an issue with the certificate at archive.ics.uci.edu analysis here

I created a clean empty elixir project with scidata as a single dependency and got the following errors:

iex(3)> Scidata.Iris.download           

15:06:37.693 [notice] TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate

** (RuntimeError) {:failed_connect, [{:to_address, {~c"archive.ics.uci.edu", 443}}, {:inet, [:inet], {:tls_alert, {:bad_certificate, ~c"TLS client: In state wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate\n"}}}]}
    (scidata 0.1.10) lib/scidata/utils.ex:54: Scidata.Utils.run!/1
    (scidata 0.1.10) lib/scidata/utils.ex:12: Scidata.Utils.get!/2
    (scidata 0.1.10) lib/scidata/iris.ex:50: Scidata.Iris.download/1
    iex:3: (file)
iex(3)> Scidata.Wine.download

15:08:45.423 [notice] TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate

** (RuntimeError) {:failed_connect, [{:to_address, {~c"archive.ics.uci.edu", 443}}, {:inet, [:inet], {:tls_alert, {:bad_certificate, ~c"TLS client: In state wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate\n"}}}]}
    (scidata 0.1.10) lib/scidata/utils.ex:54: Scidata.Utils.run!/1
    (scidata 0.1.10) lib/scidata/utils.ex:12: Scidata.Utils.get!/2
    (scidata 0.1.10) lib/scidata/wine.ex:59: Scidata.Wine.download/1
    iex:3: (file)

`StbImage` is undefined

Hey 👋

I was trying to fidget with Nx and Axon today but it seems like the new 0.1.6 release is broken (Released just today) due to a missing struct called StbImage.

== Compilation error in file lib/scidata/caltech101.ex ==
** (CompileError) lib/scidata/caltech101.ex:153: StbImage.__struct__/0 is undefined, cannot expand struct StbImage. Make sure the struct name is correct. If the struct name exists and is correct but it still cannot be found, you likely have cyclic module usage in your code
    (elixir 1.13.4) expanding macro: Kernel.|>/2

Thank you for the great work 👍

Add basic datasets

We'd like to add datasets like those available through PyTorch, Tensorflow, Hugging Face, and scikit-learn. Here's a non-comprehensive list to get started:

Text

Vision

Misc.

Iris
Wine recognition
Generated datasets (see scikit-learn)

Update stb_image to 0.4

The API has changed since then. I think we can return the underlying StbImage structs from v0.4!

elixir-nx / scidata Goto Github PK

scidata's People

Contributors

Stargazers

Watchers

Forkers