elixir-nx / scidata Goto Github PK
View Code? Open in Web Editor NEWDownload and normalize datasets related to science
License: Apache License 2.0
Download and normalize datasets related to science
License: Apache License 2.0
Hello
Thanks for providing this package!
I am new Elixir and kind of confused that you download a CA bundle in such a high level package.
Is it possible to use the one Hex is using? Or the one provided by the operating system?
Line 40 in ea7a488
Downloading a dataset with OTP >= 24 yields this warning:
12:27:18.691 [warning] Description: 'Authenticity is not established by certificate path validation'
Reason: 'Option {verify, verify_peer} and cacertfile/cacerts is missing'
We should consider swapping :httpc
with a different client to leverage reasonable SSL defaults (e.g. CA cert files). Two candidates for a new client are :req
and :hackney
. The former is a reasonable choice given the pipeline that manipulates a request/response tuple in Utils.get!
.
UCI updated their website today and links to the Iris
and Wine
datasets are no longer valid.
With the addition of RNNs in Axon, I'd like to start adding some examples of using them in practice. @t-rutten would you be open to adding some NLP related datasets to the repository?
https://huggingface.co/ this has been quite popular lately.
Maybe something to look at for scidata / axon.
Currently they return a map and still support transforms. Is it possible to normalize them to a similar result as the other datasets? For example, return a tuple of {{input_binary, input_type, input_shape}, {label_binary, label_type, label_shape}}
?
Related to discussion in #11 and would resolve elixir-nx/axon#25
I have been digging in to input pipelines, specifically tf.data and torch.utils.data.DataLoader. I'm more familiar with tf.data
, and it's got a more intentionally functional pattern, so I'm mostly biased towards an API very similar to theirs. Here's a recent paper on tf.data
as well.
The goal of an efficient input pipeline is to keep accelerators as busy as possible. There are a lot of places bottlenecks can happen (large IO operations, data transfer to GPU, slow input transformations, etc.) so it's important that any implementation be as performance sensitive as possible. tf.data
has some interesting benchmarks where they simulate an "infinitely fast" neural network to measure the absolute throughput of their API and I believe they achieve something like 13k images processed per second - it would be interesting to replicate some of these benchmarks, but I won't get ahead of myself.
Input pipelines can be characterized in 3 phases: Extract. Transform. Load. I'll briefly summarize the stages and their challenges.
This stage is reading data from storage - think loading images from directories or streaming text from files. It's heavily IO bound and slow; however, because most practical datasets are massive, loading the entire dataset into memory is impractical.
This stage is applying transformations and preprocessing to the input data. This could be anything from image augmentation to applying masks, padding, etc. Most operations are compute intensive; however, because the accelerator should be busy doing the actual training, transformation work is most efficiently offloaded to the CPU.
This stage actually loads data into accelerators. Transferring data from CPU to an accelerator can prove costly, but there are some tricks such as staging / prefetching input buffers to improve performance.
The tf.data
main abstraction is the tf.data.Dataset
which represents a stateless input pipeline. The tf.data.Dataset
is analogous to an Elixir Stream
. It can be transformed with functions such as filter
, flat_map
, batch
, map
, reduce
, etc. and these transformations are fused into a graph that can then be statically optimized. The input pipeline also offers the ability to "prefetch" data (stage for efficient transfer to accelerators), cache data so it's read from memory or faster storage later on, as well as "dynamic" optimizations that tune the "parallelism" and memory usage of the pipeline.
Based on the considerations above, I propose we create an input pipeline abstraction very similar to tf.data
based on Streams. Here are my initial thoughts:
First, we can define a struct that stores the actual input / label stream as well as metadata:
defstruct :input, :label, :input_shape, :input_type, :label_shape, :label_type, :supervised
So we can capture shape / type information if necessary. :supervised
is true
in cases where labels are present and false
when they are not.
We'll then have a number of "extract" methods that return new pipelines from a variety of formats. It should also be trivial to create new "extract" methods, but we should cover the most common cases and ensure we have them as optimized as possible:
from_stream(stream, opts \\ []) :: pipeline
from_files(files, opts \\ []) :: pipeline
...
We'll also have a number of transformations. I think we might be able to do some static optimizations and fusions of our own, and we can ensure each transformation is jitted by default.
batch(pipeline, batch_size) :: pipeline
map(pipeline, map_fn) :: pipeline
filter(pipeline, filter_fn) :: pipeline
repeat(pipeline, repeat_size) :: pipeline
shuffle(pipeline, shuffle_size) :: pipeline
We'll also want some "performance" based functions, although I haven't really thought about how these can be most efficiently implemented:
prefetch(pipeline, size) :: pipeline
cache(pipeline, size) :: pipeline
And ways to lazily iterate through the dataset, although this can be done directly on input and label streams most likely. This kind of application IMO suits Elixir very well. I'm not really experienced with GenStage
, Broadway
, Flow
, etc. so maybe they are useful here and somebody else can comment or maybe they are irrelevant and I'll not bring them up again :)
I believe the responsibility of building out this API probably best fits in this library, unless we want to limit the purpose of Scidata to just focusing on datasets and move the pipeline logic elsewhere. Pending any feedback, I can start putting some basic things together in a PR, and then begin working integrating it with Axon.
I don't know if there's much you can do about it but trying to download the Iris or Wine datasets fails with a bad certificate error. All the other datasets download correctly.
I think there might actually be an issue with the certificate at archive.ics.uci.edu
analysis here
I created a clean empty elixir project with scidata as a single dependency and got the following errors:
iex(3)> Scidata.Iris.download
15:06:37.693 [notice] TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate
** (RuntimeError) {:failed_connect, [{:to_address, {~c"archive.ics.uci.edu", 443}}, {:inet, [:inet], {:tls_alert, {:bad_certificate, ~c"TLS client: In state wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate\n"}}}]}
(scidata 0.1.10) lib/scidata/utils.ex:54: Scidata.Utils.run!/1
(scidata 0.1.10) lib/scidata/utils.ex:12: Scidata.Utils.get!/2
(scidata 0.1.10) lib/scidata/iris.ex:50: Scidata.Iris.download/1
iex:3: (file)
iex(3)> Scidata.Wine.download
15:08:45.423 [notice] TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate
** (RuntimeError) {:failed_connect, [{:to_address, {~c"archive.ics.uci.edu", 443}}, {:inet, [:inet], {:tls_alert, {:bad_certificate, ~c"TLS client: In state wait_cert_cr at ssl_handshake.erl:2126 generated CLIENT ALERT: Fatal - Bad Certificate\n"}}}]}
(scidata 0.1.10) lib/scidata/utils.ex:54: Scidata.Utils.run!/1
(scidata 0.1.10) lib/scidata/utils.ex:12: Scidata.Utils.get!/2
(scidata 0.1.10) lib/scidata/wine.ex:59: Scidata.Wine.download/1
iex:3: (file)
Hey ๐
I was trying to fidget with Nx and Axon today but it seems like the new 0.1.6
release is broken (Released just today) due to a missing struct called StbImage
.
== Compilation error in file lib/scidata/caltech101.ex ==
** (CompileError) lib/scidata/caltech101.ex:153: StbImage.__struct__/0 is undefined, cannot expand struct StbImage. Make sure the struct name is correct. If the struct name exists and is correct but it still cannot be found, you likely have cyclic module usage in your code
(elixir 1.13.4) expanding macro: Kernel.|>/2
Thank you for the great work ๐
We'd like to add datasets like those available through PyTorch, Tensorflow, Hugging Face, and scikit-learn. Here's a non-comprehensive list to get started:
The API has changed since then. I think we can return the underlying StbImage structs from v0.4!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.