juliaml / mldatasets.jl Goto Github PK
View Code? Open in Web Editor NEWUtility package for accessing common Machine Learning datasets in Julia
Home Page: https://juliaml.github.io/MLDatasets.jl/stable
License: MIT License
Utility package for accessing common Machine Learning datasets in Julia
Home Page: https://juliaml.github.io/MLDatasets.jl/stable
License: MIT License
One of the most famous dataset for a beginner to start is the "Titanic dataset" , which is used for Exploratory data analysis and prediction of outcome using logistic regression, Decision trees, random forests, etc. I just think this dataset should be added so that can be easy for a beginner to get started in Machine learning in julia with a beginner friendly dataset. Also if approved, I am willing to work on this issue as it can be a great addition to these other famous datasets.
julia> MNIST.traindata()
This program has requested access to the data dependency MNIST.
which is not currently installed. It can be installed automatically, and you will not see this message again.
Do you want to download the dataset from ["https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz"] to "/Users/anthony/.julia/datadeps/MNIST"?
[y/n]
This kind of interactive behaviour can be a headache when loading data in an automated setting, as one does not know ahead of time if a prompt is going to be requested or not. See for example, this issue:
FluxML/MLJFlux.jl#141 (comment)
Could we perhaps have an optional kwarg, as in MNIST.training_data(force=true)
?
Or is there already a way to do this?
Some of the features of the OGBDataset are downloaded as torch tensor stored in the ".pt" format. They are currently ignored at the moment, but we could load them using Pickle.jl (e.g. see this comment)
Hello,
I've been trying to use the package for the first time:
julia> using MLDatasets
julia> MNIST.traindata(1)
The function tries to download the data from http://yann.lecun.com/exdb/mnist/, but it seems the website is down.
In a fresh julia 1.7 session
julia> @time using MLDatasets
13.485246 seconds (20.31 M allocations: 1.158 GiB, 7.56% gc time, 61.69% compilation time)
Is there a way to conditionally import packages?
julia> for pkg in [:ImageCore, :CSV, :HDF5, :JLD2, :JSON3]; print(pkg); @time @eval using $pkg; end
ImageCore 2.141235 seconds (3.02 M allocations: 200.377 MiB, 4.50% gc time, 32.08% compilation time)
CSV 3.817959 seconds (6.14 M allocations: 348.493 MiB, 9.81% gc time, 90.11% compilation time)
HDF5 0.723358 seconds (1.34 M allocations: 73.225 MiB, 1.69% gc time, 93.80% compilation time)
JLD2 1.139716 seconds (1.36 M allocations: 78.966 MiB, 3.95% gc time, 60.77% compilation time)
JSON3 0.033367 seconds (49.09 k allocations: 3.014 MiB)
julia> for pkg in [:DataFrames, :MLUtils, :Pickle, :NPZ, :MAT]; print(pkg); @time @eval using $pkg; end
DataFrames 1.789793 seconds (2.03 M allocations: 137.197 MiB, 4.63% gc time)
MLUtils 1.743072 seconds (2.07 M allocations: 117.900 MiB, 4.83% gc time, 47.32% compilation time)
Pickle 0.130685 seconds (159.17 k allocations: 9.751 MiB, 17.77% compilation time)
NPZ 0.504406 seconds (1.19 M allocations: 61.838 MiB, 4.05% gc time, 98.87% compilation time)
MAT 0.009792 seconds (22.84 k allocations: 1.044 MiB)
Related discourse thread
CRef: #57 (comment)
We currently use DataDeps as an interface to download datasets from original websites. While it's good to give a clear license and source, it can be unstable for reproducibility because worldwide users might have difficulties connecting original sites. The original sites might also be offline for various reasons, e.g., #57.
To avoid issues like #57 in the future and accelerate dataset downloading, we could take advantage of Julia's Artifacts system and let Pkg/Storage servers hold and distribute the datasets. MLDatasets don't hold large datasets so it adds little stress to the Julia ecosystem.
using MLDatasets
tx,ty = MNIST.traindata()
GZip.ZError(-5, "buffer error")
Stacktrace:
[1] close(s::GZip.GZipStream)
@ GZip ~/.julia/packages/GZip/JNmGn/src/GZip.jl:163
[2] gzopen(::MLDatasets.MNIST.Reader.var"#5#6", ::String, ::String)
@ GZip ~/.julia/packages/GZip/JNmGn/src/GZip.jl:270
[3] readimages
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/Reader/readimages.jl:80 [inlined]
[4] #traintensor#2
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:50 [inlined]
[5] traindata(::Type{FixedPointNumbers.N0f8}; dir::Nothing)
@ MLDatasets.MNIST ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:221
[6] #traindata#11
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:225 [inlined]
[7] traindata()
@ MLDatasets.MNIST ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:225
[8] top-level scope
@ In[110]:1
[9] eval
@ ./boot.jl:373 [inlined]
[10] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1196
julia 1.7
macos 12.2.1
apple m1
mem 8g
There is environment variable DATADEPS_ALWAY_ACCEPT
used on 18 places of this dataset.
But the correct name is DATADEPS_ALWAYS_ACCEPT
.
Should I fix this as part of my PR #79 ?
Because now the variable is not doing anything, which may be confusing.
Hello,
Any plan for updating the package for julia 0.6?
Any plan for making the package installable with Pkg.add?
Pkg.add("MLDatasets")
unknown package MLDatasets
macro expansion at ./pkg/entry.jl:53 [inlined]
(::Base.Pkg.Entry.##1#3{String,Base.Pkg.Types.VersionSet})() at ./task.jl:335
Hi @hshindo any interest in moving this package to JuliaML and/or collaborating with us?
I am new to Julia and I am working on a medical imaging research project. I would like to add the medical decathlon datasets (http://medicaldecathlon.com) (https://arxiv.org/pdf/1902.09063.pdf) to this repo as I think it would be a great way for me to learn what's going on and it would likely benefit the entire Julia community. I will definitely need help in this endeavor though so please let me know if that is something of interest to the contributors of this project.
This is a list of datasets that are available in Flux but not in MLDatasets. It would be useful to add them soon here, so that we can make MLDatasets' the default dataset provider for Flux.
The movielens recommendation matrices could be nice additions to MLDatasets
Hi, we would like to have mutagenesis dataset here.
Should I add it as a PR?
The features and target feature names are not properly aligned, everything is aligned in a single row making it cluttered.
Love the package. Would EMNIST be small enough to add?
https://www.nist.gov/itl/products-and-services/emnist-dataset
We should think about how we use CI for testing the dataset interfaces, since we don't want to spam their servers with download requests for potentially very big datasets.
Maybe a good mode would be to not trigger travis automatically but instead having to trigger it manually every now and then. This way we could also extend the tests to multiple versions an platforms
@JuliaRegistrator register
It seems that the package is broken on OS X with Julia 1.0+, but it works normally on Linux.
train_x, train_y = MNIST.traindata() # throws an error when trying to download the dataset
Do you want to download the dataset from ["http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"] to "/Users/fineday/.julia/datadeps/MNIST"?
[y/n]
y
ERROR: UndefVarError: GET not defined
Stacktrace:
[1] (::getfield(Base, Symbol("##683#685")))(::Task) at ./asyncmap.jl:178
[2] foreach(::getfield(Base, Symbol("##683#685")), ::Array{Any,1}) at ./abstractarray.jl:1835
[3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{String,1}) at ./asyncmap.jl:178
[4] wrap_n_exec_twice at ./asyncmap.jl:154 [inlined]
[5] #async_usemap#668(::Int64, ::Nothing, ::Function, ::getfield(DataDeps, Symbol("##14#15")){typeof(DataDeps.fetch_http),String}, ::Array{String,1}) at ./asyncmap.jl:103
[6] #async_usemap at ./none:0 [inlined]
[7] #asyncmap#667 at ./asyncmap.jl:81 [inlined]
[8] asyncmap at ./asyncmap.jl:81 [inlined]
[9] run_fetch at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:104 [inlined]
[10] #download#13(::Array{String,1}, ::Nothing, ::Bool, ::Function, ::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:78
[11] download at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:70 [inlined]
[12] handle_missing at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:10 [inlined]
[13] _resolve(::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:83
[14] resolve(::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:29
[15] resolve(::String, ::String, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:54
[16] resolve at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:73 [inlined]
[17] #2 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:17 [inlined]
[18] withenv(::getfield(MLDatasets, Symbol("##2#3")){String,Nothing}, ::Pair{String,String}) at ./env.jl:148
[19] with_accept at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:10 [inlined]
[20] #datadir#1 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:14 [inlined]
[21] datadir at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:14 [inlined]
[22] #datafile#4(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::String, ::String, ::Nothing) at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:32
[23] datafile at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:32 [inlined]
[24] #traintensor#2 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:54 [inlined]
[25] #traintensor at ./none:0 [inlined]
[26] #traindata#10 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:231 [inlined]
[27] #traindata at ./none:0 [inlined]
[28] #traindata#11 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:235 [inlined]
[29] traindata() at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:235
[30] top-level scope at none:0
After adding the Titanic Dataset I am not able to fully load the dataset for use.
It seems to be stuck printing the line:
INFO: Do you want to download the dataset from String["https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt", "https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.test.txt"] to "/home/jrun/MLDatasets/.julia/datadeps/PTBLM"?
INFO: [y/n]
This is running on JuliaCIBot so a [y/n] answer cannot be given. It needs to be automatic.
I'd like to contirbute 2 synthetic datasets to this package
The 2D ring dataset
ring_dataset = RingDataset(10, 1.5, 0.1)
x = rand(ring_dataset, 1_000)
Images generated from independent features
feature_dataset = FeatureDataset(get_features_griffiths2011())
x = rand(feature_dataset, 100)
How could I proceed?
PS: I already implemented them in https://github.com/xukai92/MLToolkit.jl/blob/master/src/Datasets/Datasets.jl
I'd like to open a discussion on how we should move forward with implementing a getobs
and nobs
compliant api,
while possibly also simplifying the interface and the maintenance burden.
I think we should move away from the module-based approach and adopt a type-based one. Also could be convenient to have some lean type hierarchy.
Below is an initial proposal
####### src/datasets.jl
abstract type AbstractDataset end
abstract type FileDataset <: AbstractDataset end
abstract type InMemoryDataset <: AbstractDataset end
###### src/vision/mnist.jl
"""
docstring here, also exposing the internal fields of the struct for transparency
"""
struct MNIST <: InMemoryDataset
x # alternative names: `features` or `inputs`
targets # `labels` or j`y`
num_classes # optional
function MNIST(path=nothing; split = :train) # split could be made a mandatory keywork arg
@assert split in [:train, :test]
..........
end
end
LearnBase.getobs(data::MNIST) = (data.x, data.target)
LearnBase.getobs(data::MNIST, idx) = (data.x[:,idx], data.target[idx])
LearnBase.nobs(data::MNIST) = length(data.taget)
.... other stuff ....
using MLDasets: MNIST
using Flux.
train_data = MNIST(split = :train)
test_data = MNIST(split =:test)
xtrain, ytrain = getobs(train_data)
xtrain, ytrain = train_data # we can add this for convenience
xs, ys = getobs(train_data, 1:10)
xs, ys = train_data[1:10] # we can add this for convenience
train_loader = DataLoader(train_data; batch_size=128)
Do we need transformations as part of the datasets?
This is a possible interface that assumes the transform to operate on whatever is returned by getobs
getobs(data::MNIST, idx) = data.transform(data.x[:,idx], data.y[idx])
Data(split = :train, transform = (x, y) -> (random_crop(x), y)
We can create a deprecation path for the code
using MLDataset: MNIST
xtrain, ytrain = MNIST.traindata(...)
by implementing
function getproperty(data::MNIST, s::Symbol)
if s == :traindata
@warn "deprecated method"
return ....
....
end
The pattern
using MLDataset.MNIST: traindata
xtrain, ytrain = traindata(...)
instead is more problematic, because assumes a module MNIST exists, but this (deprecated) module would collide with the struct MNIST. A workaround is to call the new struct MNISTDataset
, although I'm not super happy with this long name
I am beginner to Julia.
Gettting this with MNIST.traindata()
It would be helpful (for other packages that use this) to support newer version:
MAT 0.10
I'll submit a PR and see if the tests pass with this newer versions.
Lets keep a running list of datasets we should add. Here's some links to consider:
I tried to use MNIST.convert2image(MNIST.traintensor(1))
but it seems like I don't have ImageCore
package. I think it's more reasonable to add this as a dependency.
What are the advantages and disadvantages over DataDeps.jl?
How would the MNIST implementation look like if we were to move to DataSets.jl?
cc @c42f
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
Would it make sense to add the Palmer penguin dataset? It was recently proposed as an alternative to the well-known Iris dataset due to growing sentiment about Ronald Fisher's eugenicist past. Since Iris is included in MLDataset, I assumed that it might fit in here quite well. Or should it rather be added to RDatasets (but then the same argument would apply to the Iris dataset, it seems)?
I am running Julia on Mac OS 11.1, and running into an error when trying to precompile. I get the error message,
ERROR: LoadError: LoadError: error compiling top-level scope: could not load library "libz"
dlopen(libz.dylib, 1): image not found
Stacktrace:
[1] include at ./boot.jl:317 [inlined]
[2] include_relative(::Module, ::String) at ./loading.jl:1044
[3] include at ./sysimg.jl:29 [inlined]
[4] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/GZip.jl:2
[5] top-level scope at none:0
[6] include_relative(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[7] include(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[8] top-level scope at none:2
[9] eval at ./boot.jl:319 [inlined]
[10] eval(::Expr) at ./client.jl:393
[11] top-level scope at ./none:3
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/zlib_h.jl:13
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/GZip.jl:73
ERROR: LoadError: LoadError: LoadError: Failed to precompile GZip [92fee26a-97fe-5a0c-ad85-20a5f3185b63] to /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/compiled/v1.0/GZip/s2LKY.ji.
Stacktrace:
[1] error(::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[2] compilecache(::Base.PkgId, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[3] _require(::Base.PkgId) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[4] require(::Base.PkgId) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:? (repeats 2 times)
[5] include at ./boot.jl:317 [inlined]
[6] include_relative(::Module, ::String) at ./loading.jl:1044
[7] include at ./sysimg.jl:29 [inlined]
[8] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/MNIST.jl:26
[9] top-level scope at none:0
[10] include at ./boot.jl:317 [inlined]
[11] include_relative(::Module, ::String) at ./loading.jl:1044
[12] include at ./sysimg.jl:29 [inlined]
[13] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MLDatasets.jl:1
[14] top-level scope at none:0
[15] include_relative(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[16] include(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[17] top-level scope at none:2
[18] eval at ./boot.jl:319 [inlined]
[19] eval(::Expr) at ./client.jl:393
[20] top-level scope at ./none:3
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/Reader/Reader.jl:2
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/MNIST.jl:70
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MLDatasets.jl:45
I already tried adding zlib-ng with home-brew, but it didn't work. Would appreciate any suggestions :)
Thanks!
Calling the Titanic.features()
returns a matrix not in the prescribed order of Titanic.feature_names()
.
using MLDatasets: Titanic
features = Titanic.features()
returns the following matrix:
11×891 Matrix{Any}:
1 12 23 34 45 56 67 78 89 100 111 122 133 … "S" "S" "S" "C" "S" "Q" "S" "C" "C" "S" "S"
2 13 24 35 46 57 68 79 90 101 112 123 134 "S" "S" "C" "S" "S" "S" "S" "S" "C" "S" "S"
3 14 25 36 47 58 69 80 91 102 113 124 135 "S" "S" "S" "S" "S" "C" "S" "C" "S" "S" "S"
4 15 26 37 48 59 70 81 92 103 114 125 136 "C" "S" "S" "S" "C" "Q" "C" "S" "S" "S" "S"
5 16 27 38 49 60 71 82 93 104 115 126 137 "S" "S" "S" "S" "S" "" "S" "S" "S" "S" "S"
6 17 28 39 50 61 72 83 94 105 116 127 138 … "S" "S" "S" "S" "S" "C" "S" "C" "S" "C" "Q"
7 18 29 40 51 62 73 84 95 106 117 128 139 "Q" "Q" "C" "S" "S" "S" "C" "S" "S" "C" "S"
8 19 30 41 52 63 74 85 96 107 118 129 140 "S" "S" "S" "S" "S" "C" "C" "S" "S" "S" "S"
9 20 31 42 53 64 75 86 97 108 119 130 141 "Q" "C" "S" "S" "S" "S" "S" "S" "C" "S" "S"
10 21 32 43 54 65 76 87 98 109 120 131 142 "S" "Q" "S" "S" "S" "S" "S" "S" "S" "S" "C"
11 22 33 44 55 66 77 88 99 110 121 132 143 … "C" "S" "S" "S" "S" "C" "S" "S" "S" "C" "Q"
Upon observation, it seems that the CSV file is read in a sequential manner and values are placed in the wrong matrix elements.
A call to Titanic.features()
should return a matrix of the form
1 0 3 "Braund, Mr. Owen Harris" ... 7.25 "" S
2 1 1 "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" ... 71.2833 C85 C
How do we deal with datasets that are too large to be read into an Array
? Something of 5-50Gb, for example. Are there any tools for it or earlier discussion?
I thought about:
AbstractArray
interface, so existing tools will work. Cons: some tools and algorithms may expect data to be in memory while for disk-based arrays their performance will drop drastically.We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g. d = MNIST()
we just load the JLD2 file.
Example:
function MNIST(...)
dataset_dir = ...
processed_file = joinpath(dataset_dir, "processed", "dataset.jld2")
if isfile(processed_file)
return FileIO.load(processed_file, "dataset")
end
mnist = ...
if isfile(processed_file)
FileIO.save(processed_file, Dict("dataset" => mnist))
end
return mnist
end
The command found on the package documentation
MNIST.convert2features(MNIST.traintensor())
now produces the error
┌ Warning: convert2features
is deprecated, use reshape
instead.
│ caller = top-level scope at In[7]:1
└ @ Core In[7]:1
DimensionMismatch("parent has 47040000 elements, which is incompatible with size ()")
Using Julia 1.4
Thanks
We can add a page to the docs linking to other repos that make it easy to download datasets in julia. Examples:
I got the following error message from tst_mnist.jl
when I ran ] test MLDatasets
:
Got exception outside of a @test could not load symbol "gzopen64": dlsym(0xfff153c1b3a0, gzopen64): symbol not found
Test Summary: | Pass Error Total tst_mnist.jl | 18 90 108 Constants | 7 7 convert2images | 11 11 File Header | 4 4 Images | 72 72 Test that traintensor are the train images | 30 30 Test that testtensor are the test images | 30 30 traintensor with T=Float32 | 1 1 traintensor with T=Float64 | 1 1 traintensor with T=N0f8 | 1 1 traintensor with T=Int64 | 1 1 traintensor with T=UInt8 | 1 1 testtensor with T=Float32 | 1 1 testtensor with T=Float64 | 1 1 testtensor with T=N0f8 | 1 1 testtensor with T=Int64 | 1 1 testtensor with T=UInt8 | 1 1 Labels | 12 12 trainlabels | 1 1 testlabels | 1 1 Data | 2 2 check traindata against traintensor and trainlabels | 1 1 check testdata against testtensor and testlabels | 1 1 ERROR: LoadError: Some tests did not pass: 18 passed, 0 failed, 90 errored, 0 broken.
I tried reinstalling the package but it didn't work.
My OS is MacOS Monterey 12.0.1, and I am using Julia v1.6.3
with MLDatasets version v0.5.13.
All links in the "datasets" section of the home page (that should point to the specific dataset documentation) are broken (404 not found error), e.g. https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/
Make sure that most v0.5 code (e.g. MNIST.traindata(Float32)
) keeps working in v0.6 but with a deprecation warning.
Related to #73.
Having to say "yes" each time seems just an annoyance.
I think this should be considered a breaking change.
@johnnychen94 last commit failed to deploy the docs
https://github.com/JuliaML/MLDatasets.jl/runs/3172885457
Something is wrong with the DOCUMENTER_KEY, do you know how to fix it?
The OGB datasets is an important graph benchmark database.
The Open Graph Benchmark (OGB) aims to provide graph datasets that cover important graph machine learning tasks, diverse dataset scale, and rich domains.
Hope can join it here.
I do not know if this issue is related to MLDatasets or Images, but trying to using MLDatasets, Images
results in segmentation fault for me.
Package versions (Julia 1.6.0):
Images v0.24.1
MLDatasets v0.5.6
ImageNet is quite large and locked behind terms of access that require an account.
However it would be nice to be able to either
ENV
) variable to download ImageNet through MLDatasetsand be able to use MLDatasets' interface of
train_x, train_y = ImageNet.traindata()
test_x, test_y = ImageNet.testdata()
as well as ImageNet.convert2image(x)
.
Ideally data would be in WHCN format for Flux and Metalhead models.
We could drop the dependence on GZip
, which is not actively maintained, by just calling DataDeps.unpack
(which relies on the p7zip binary) to uncompress files. It would be easy to change the MNISTReader & co. to just do that.
This could solve issues like #118.
Additionally, we may want to go in the direction of saving processed versions of the datasets (e.g. a JLD2 save of the dataset object itself) for faster I/O.
The deserve a bit of love in https://juliaml.github.io/MLDatasets.jl/latest/
Hi,
this package seems useful, in good shape, and also easy to mantain. Is there something preventing official release?
The table in the Data Size section of the documentation doesn't appear correctly formatted (even though it gets correctly formatted when previewing the markdown on GitHub).
help
(v1.1) pkg> add MLDatasets
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
Installed GZip ─────── v0.5.0
Installed DataDeps ─── v0.6.2
Installed MLDatasets ─ v0.3.0
Updating `~/.julia/environments/v1.1/Project.toml`
[eb30cadb] + MLDatasets v0.3.0
Updating `~/.julia/environments/v1.1/Manifest.toml`
[124859b0] + DataDeps v0.6.2
[92fee26a] + GZip v0.5.0
[eb30cadb] + MLDatasets v0.3.0
julia> using MLDatasets
[ Info: Precompiling MLDatasets [eb30cadb-4394-5ae3-aed4-317e484a6458]
ERROR: LoadError: LoadError: error compiling top-level scope: could not load library "libz"
libz.so: cannot open shared object file: No such file or directory
Stacktrace:
[1] include at ./boot.jl:326 [inlined]
[2] include_relative(::Module, ::String) at ./loading.jl:1038
[3] include at ./sysimg.jl:29 [inlined]
[4] include(::String) at /home/dicbro/.julia/packages/GZip/LD2ly/src/GZip.jl:2
[5] top-level scope at none:0
[6] include at ./boot.jl:326 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1038
[8] include(::Module, ::String) at ./sysimg.jl:29
[9] top-level scope at none:2
[10] eval at ./boot.jl:328 [inlined]
[11] eval(::Expr) at ./client.jl:404
[12] top-level scope at ./none:3
in expression starting at /home/dicbro/.julia/packages/GZip/LD2ly/src/zlib_h.jl:11
in expression starting at /home/dicbro/.julia/packages/GZip/LD2ly/src/GZip.jl:73
ERROR: LoadError: LoadError: LoadError: Failed to precompile GZip [92fee26a-97fe-5a0c-ad85-20a5f3185b63] to /home/dicbro/.julia/compiled/v1.1/GZip/s2LKY.ji.
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
[3] _require(::Base.PkgId) at ./loading.jl:960
[4] require(::Base.PkgId) at ./loading.jl:858
[5] require(::Module, ::Symbol) at ./loading.jl:853
[6] include at ./boot.jl:326 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1038
[8] include at ./sysimg.jl:29 [inlined]
[9] include(::String) at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/MNIST.jl:26
[10] top-level scope at none:0
[11] include at ./boot.jl:326 [inlined]
[12] include_relative(::Module, ::String) at ./loading.jl:1038
[13] include at ./sysimg.jl:29 [inlined]
[14] include(::String) at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MLDatasets.jl:1
[15] top-level scope at none:0
[16] include at ./boot.jl:326 [inlined]
[17] include_relative(::Module, ::String) at ./loading.jl:1038
[18] include(::Module, ::String) at ./sysimg.jl:29
[19] top-level scope at none:2
[20] eval at ./boot.jl:328 [inlined]
[21] eval(::Expr) at ./client.jl:404
[22] top-level scope at ./none:3
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/Reader/Reader.jl:2
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/MNIST.jl:70
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MLDatasets.jl:45
ERROR: Failed to precompile MLDatasets [eb30cadb-4394-5ae3-aed4-317e484a6458] to /home/dicbro/.julia/compiled/v1.1/MLDatasets/9CUQK.ji.
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
[3] _require(::Base.PkgId) at ./loading.jl:960
[4] require(::Base.PkgId) at ./loading.jl:858
[5] require(::Module, ::Symbol) at ./loading.jl:853
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.