juliaml / mldatasets.jl Goto Github PK

View Code? Open in Web Editor NEW

229.0 229.0 46.0 2.66 MB

Utility package for accessing common Machine Learning datasets in Julia

Home Page: https://juliaml.github.io/MLDatasets.jl/stable

License: MIT License

Julia 100.00%

dataset julia machine-learning

mldatasets.jl's People

Contributors

Stargazers

Watchers

Forkers

godisemo berkan1056 carlolucibello dfdx paethon enterstudio zgornel alyst juliadocsforks damiendr alabamacajun thiwankajayasiri stefanjwojcik stjordanis opmrain johnnychen94 hiiroo appleparan natema darsnack yeahnew andrew-saydjari lorenzoh sylvaticus yuehhua hoyttyoh playfloor ctuavastlab 13269252368 soham-chitnis10 dsantra92 adrhill athatheo christiangnrd mscott99 aurorarossi tambup sarnoult pitmonticone jeremyrieussec digital-carver scratch-er josemanuel22 getitdone007 rbsparky

mldatasets.jl's Issues

add Titanic dataset

One of the most famous dataset for a beginner to start is the "Titanic dataset" , which is used for Exploratory data analysis and prediction of outcome using logistic regression, Decision trees, random forests, etc. I just think this dataset should be added so that can be easy for a beginner to get started in Machine learning in julia with a beginner friendly dataset. Also if approved, I am willing to work on this issue as it can be a great addition to these other famous datasets.

Add option to override interactive prompting

julia> MNIST.traindata()
This program has requested access to the data dependency MNIST.
which is not currently installed. It can be installed automatically, and you will not see this message again.

Do you want to download the dataset from ["https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz", "https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz"] to "/Users/anthony/.julia/datadeps/MNIST"?
[y/n]

This kind of interactive behaviour can be a headache when loading data in an automated setting, as one does not know ahead of time if a prompt is going to be requested or not. See for example, this issue:

FluxML/MLJFlux.jl#141 (comment)

Could we perhaps have an optional kwarg, as in MNIST.training_data(force=true)?

Or is there already a way to do this?

load torch tensors in OGBDatasets

Some of the features of the OGBDataset are downloaded as torch tensor stored in the ".pt" format. They are currently ignored at the moment, but we could load them using Pickle.jl (e.g. see this comment)

Cannot download MNIST

Hello,

I've been trying to use the package for the first time:

julia> using MLDatasets

julia> MNIST.traindata(1)

The function tries to download the data from http://yann.lecun.com/exdb/mnist/, but it seems the website is down.

using MLDatasets is very slow

In a fresh julia 1.7 session

julia> @time using MLDatasets
 13.485246 seconds (20.31 M allocations: 1.158 GiB, 7.56% gc time, 61.69% compilation time)

Is there a way to conditionally import packages?

julia> for pkg in [:ImageCore, :CSV, :HDF5, :JLD2, :JSON3]; print(pkg); @time @eval using $pkg; end
ImageCore  2.141235 seconds (3.02 M allocations: 200.377 MiB, 4.50% gc time, 32.08% compilation time)
CSV  3.817959 seconds (6.14 M allocations: 348.493 MiB, 9.81% gc time, 90.11% compilation time)
HDF5  0.723358 seconds (1.34 M allocations: 73.225 MiB, 1.69% gc time, 93.80% compilation time)
JLD2  1.139716 seconds (1.36 M allocations: 78.966 MiB, 3.95% gc time, 60.77% compilation time)
JSON3  0.033367 seconds (49.09 k allocations: 3.014 MiB)

julia> for pkg in [:DataFrames, :MLUtils, :Pickle, :NPZ, :MAT]; print(pkg); @time @eval using $pkg; end
DataFrames  1.789793 seconds (2.03 M allocations: 137.197 MiB, 4.63% gc time)
MLUtils  1.743072 seconds (2.07 M allocations: 117.900 MiB, 4.83% gc time, 47.32% compilation time)
Pickle  0.130685 seconds (159.17 k allocations: 9.751 MiB, 17.77% compilation time)
NPZ  0.504406 seconds (1.19 M allocations: 61.838 MiB, 4.05% gc time, 98.87% compilation time)
MAT  0.009792 seconds (22.84 k allocations: 1.044 MiB)

Related discourse thread

Distribute datasets as artifacts

CRef: #57 (comment)

We currently use DataDeps as an interface to download datasets from original websites. While it's good to give a clear license and source, it can be unstable for reproducibility because worldwide users might have difficulties connecting original sites. The original sites might also be offline for various reasons, e.g., #57.

To avoid issues like #57 in the future and accelerate dataset downloading, we could take advantage of Julia's Artifacts system and let Pkg/Storage servers hold and distribute the datasets. MLDatasets don't hold large datasets so it adds little stress to the Julia ecosystem.

run MNIST.traindata() function,get GZip.ZError(-5, "buffer error")

using MLDatasets
tx,ty = MNIST.traindata()

GZip.ZError(-5, "buffer error")

Stacktrace:
[1] close(s::GZip.GZipStream)
@ GZip ~/.julia/packages/GZip/JNmGn/src/GZip.jl:163
[2] gzopen(::MLDatasets.MNIST.Reader.var"#5#6", ::String, ::String)
@ GZip ~/.julia/packages/GZip/JNmGn/src/GZip.jl:270
[3] readimages
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/Reader/readimages.jl:80 [inlined]
[4] #traintensor#2
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:50 [inlined]
[5] traindata(::Type{FixedPointNumbers.N0f8}; dir::Nothing)
@ MLDatasets.MNIST ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:221
[6] #traindata#11
@ ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:225 [inlined]
[7] traindata()
@ MLDatasets.MNIST ~/.julia/packages/MLDatasets/N3Lgo/src/MNIST/interface.jl:225
[8] top-level scope
@ In[110]:1
[9] eval
@ ./boot.jl:373 [inlined]
[10] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1196

julia 1.7

macos 12.2.1
apple m1
mem 8g

typo in DataDeps environment variable

There is environment variable DATADEPS_ALWAY_ACCEPT used on 18 places of this dataset.
But the correct name is DATADEPS_ALWAYS_ACCEPT.
Should I fix this as part of my PR #79 ?
Because now the variable is not doing anything, which may be confusing.

MLDatasets in julia 0.6/0.7

Hello,

Any plan for updating the package for julia 0.6?
Any plan for making the package installable with Pkg.add?

Pkg.add("MLDatasets")
unknown package MLDatasets
macro expansion at ./pkg/entry.jl:53 [inlined]
(::Base.Pkg.Entry.##1#3{String,Base.Pkg.Types.VersionSet})() at ./task.jl:335

[Quetion] Why are the image matrices transposed?

Hi, I'm a beginner in ML and I was studying Flux.jl using the MNIST image dataset.
However, I realized that the images in MLDatasets were transposed, but in Flux, which is Deprecated, aren't transposed.

Is there any reason for the images to be transposed in MLDatasets?

JuliaML

Hi @hshindo any interest in moving this package to JuliaML and/or collaborating with us?

Add Medical Decathlon Datasets

I am new to Julia and I am working on a medical imaging research project. I would like to add the medical decathlon datasets (http://medicaldecathlon.com) (https://arxiv.org/pdf/1902.09063.pdf) to this repo as I think it would be a great way for me to learn what's going on and it would likely benefit the entire Julia community. I will definitely need help in this endeavor though so please let me know if that is something of interest to the contributors of this project.

missing datasets from Flux

This is a list of datasets that are available in Flux but not in MLDatasets. It would be useful to add them soon here, so that we can make MLDatasets' the default dataset provider for Flux.

Boston Housing
CMU dictionary

Movielens datasets

The movielens recommendation matrices could be nice additions to MLDatasets

Add mutagenesis dataset

Hi, we would like to have mutagenesis dataset here.
Should I add it as a PR?

Update Titanic docs

The features and target feature names are not properly aligned, everything is aligned in a single row making it cluttered.

EMNIST

Love the package. Would EMNIST be small enough to add?

https://www.nist.gov/itl/products-and-services/emnist-dataset

Using CI for testing

We should think about how we use CI for testing the dataset interfaces, since we don't want to spam their servers with download requests for potentially very big datasets.

Maybe a good mode would be to not trigger travis automatically but instead having to trigger it manually every now and then. This way we could also extend the tests to multiple versions an platforms

register new release

@JuliaRegistrator register

Broken on Julia 1.0+

It seems that the package is broken on OS X with Julia 1.0+, but it works normally on Linux.

train_x, train_y = MNIST.traindata() # throws an error when trying to download the dataset

Do you want to download the dataset from ["http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"] to "/Users/fineday/.julia/datadeps/MNIST"?
[y/n]
y
ERROR: UndefVarError: GET not defined
Stacktrace:
 [1] (::getfield(Base, Symbol("##683#685")))(::Task) at ./asyncmap.jl:178
 [2] foreach(::getfield(Base, Symbol("##683#685")), ::Array{Any,1}) at ./abstractarray.jl:1835
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{String,1}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice at ./asyncmap.jl:154 [inlined]
 [5] #async_usemap#668(::Int64, ::Nothing, ::Function, ::getfield(DataDeps, Symbol("##14#15")){typeof(DataDeps.fetch_http),String}, ::Array{String,1}) at ./asyncmap.jl:103
 [6] #async_usemap at ./none:0 [inlined]
 [7] #asyncmap#667 at ./asyncmap.jl:81 [inlined]
 [8] asyncmap at ./asyncmap.jl:81 [inlined]
 [9] run_fetch at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:104 [inlined]
 [10] #download#13(::Array{String,1}, ::Nothing, ::Bool, ::Function, ::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:78
 [11] download at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:70 [inlined]
 [12] handle_missing at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution_automatic.jl:10 [inlined]
 [13] _resolve(::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:83
 [14] resolve(::DataDeps.DataDep{String,Array{String,1},typeof(DataDeps.fetch_http),typeof(identity)}, ::String, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:29
 [15] resolve(::String, ::String, ::String) at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:54
 [16] resolve at /Users/fineday/.julia/packages/DataDeps/CDQwy/src/resolution.jl:73 [inlined]
 [17] #2 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:17 [inlined]
 [18] withenv(::getfield(MLDatasets, Symbol("##2#3")){String,Nothing}, ::Pair{String,String}) at ./env.jl:148
 [19] with_accept at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:10 [inlined]
 [20] #datadir#1 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:14 [inlined]
 [21] datadir at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:14 [inlined]
 [22] #datafile#4(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::String, ::String, ::Nothing) at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:32
 [23] datafile at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/download.jl:32 [inlined]
 [24] #traintensor#2 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:54 [inlined]
 [25] #traintensor at ./none:0 [inlined]
 [26] #traindata#10 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:231 [inlined]
 [27] #traindata at ./none:0 [inlined]
 [28] #traindata#11 at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:235 [inlined]
 [29] traindata() at /Users/fineday/.julia/packages/MLDatasets/yNB45/src/MNIST/interface.jl:235
 [30] top-level scope at none:0

Not able to load the Titanic Dataset

After adding the Titanic Dataset I am not able to fully load the dataset for use.

Problem running Pkg.test

It seems to be stuck printing the line:

INFO: Do you want to download the dataset from String["https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt", "https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.test.txt"] to "/home/jrun/MLDatasets/.julia/datadeps/PTBLM"?
INFO: [y/n]

This is running on JuliaCIBot so a [y/n] answer cannot be given. It needs to be automatic.

Contributing synthetic datasets

I'd like to contirbute 2 synthetic datasets to this package

The 2D ring dataset

ring_dataset = RingDataset(10, 1.5, 0.1)
x = rand(ring_dataset, 1_000)

Images generated from independent features

feature_dataset = FeatureDataset(get_features_griffiths2011())
x = rand(feature_dataset, 100)

How could I proceed?

PS: I already implemented them in https://github.com/xukai92/MLToolkit.jl/blob/master/src/Datasets/Datasets.jl

redesign the package

I'd like to open a discussion on how we should move forward with implementing a getobs and nobs compliant api,
while possibly also simplifying the interface and the maintenance burden.

See FluxML/FastAI.jl#22

I think we should move away from the module-based approach and adopt a type-based one. Also could be convenient to have some lean type hierarchy.

Below is an initial proposal

AbstractDatasets

####### src/datasets.jl

abstract type AbstractDataset end
abstract type FileDataset <: AbstractDataset end
abstract type InMemoryDataset <: AbstractDataset end

MNIST Dataset

######  src/vision/mnist.jl
"""
docstring here, also exposing the internal fields of the struct for transparency
"""
struct MNIST <: InMemoryDataset
   x                              # alternative names: `features` or `inputs`
   targets                     #                               `labels` or j`y`
   num_classes           # optional
   
    function MNIST(path=nothing; split = :train)  # split could be made a mandatory keywork arg
          @assert split in [:train, :test]
           ..........
   end
end

LearnBase.getobs(data::MNIST) = (data.x, data.target) 
LearnBase.getobs(data::MNIST, idx) = (data.x[:,idx], data.target[idx]) 
LearnBase.nobs(data::MNIST) = length(data.taget)

.... other stuff ....

Usage

using MLDasets: MNIST
using Flux.
train_data = MNIST(split = :train)
test_data = MNIST(split =:test)

xtrain, ytrain = getobs(train_data)
xtrain, ytrain = train_data # we can add this for convenience

xs, ys = getobs(train_data, 1:10)
xs, ys = train_data[1:10] # we can add this for convenience


train_loader = DataLoader(train_data; batch_size=128)

Transforms

Do we need transformations as part of the datasets?
This is a possible interface that assumes the transform to operate on whatever is returned by getobs

getobs(data::MNIST, idx) = data.transform(data.x[:,idx], data.y[idx])

Data(split = :train, transform = (x, y) -> (random_crop(x), y)

Deprecation Path 1

We can create a deprecation path for the code

using MLDataset: MNIST
xtrain, ytrain = MNIST.traindata(...)

by implementing

function getproperty(data::MNIST, s::Symbol)
  if s == :traindata
    @warn "deprecated method"
    return ....

  ....
end

Deprecation Path 2

The pattern

using MLDataset.MNIST: traindata
xtrain, ytrain = traindata(...)

instead is more problematic, because assumes a module MNIST exists, but this (deprecated) module would collide with the struct MNIST. A workaround is to call the new struct MNISTDataset, although I'm not super happy with this long name

cc @johnnychen94 @darsnack @lorenzoh

IOError: Could not open stream.

I am beginner to Julia.
Gettting this with MNIST.traindata()

Support ColorTypes 0.10 and MAT 0.10

It would be helpful (for other packages that use this) to support newer version:
MAT 0.10

I'll submit a PR and see if the tests pass with this newer versions.

Additional Datasets

Lets keep a running list of datasets we should add. Here's some links to consider:

Shold ImageCore be a dependency?

I tried to use MNIST.convert2image(MNIST.traintensor(1)) but it seems like I don't have ImageCore package. I think it's more reasonable to add this as a dependency.

should we use DataSets.jl?

What are the advantages and disadvantages over DataDeps.jl?

How would the MNIST implementation look like if we were to move to DataSets.jl?

cc @c42f

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Add Palmer penguin dataset

Would it make sense to add the Palmer penguin dataset? It was recently proposed as an alternative to the well-known Iris dataset due to growing sentiment about Ronald Fisher's eugenicist past. Since Iris is included in MLDataset, I assumed that it might fit in here quite well. Or should it rather be added to RDatasets (but then the same argument would apply to the Iris dataset, it seems)?

could not load library "libz"- Failed to precompile

I am running Julia on Mac OS 11.1, and running into an error when trying to precompile. I get the error message,

ERROR: LoadError: LoadError: error compiling top-level scope: could not load library "libz"
dlopen(libz.dylib, 1): image not found
Stacktrace:
[1] include at ./boot.jl:317 [inlined]
[2] include_relative(::Module, ::String) at ./loading.jl:1044
[3] include at ./sysimg.jl:29 [inlined]
[4] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/GZip.jl:2
[5] top-level scope at none:0
[6] include_relative(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[7] include(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[8] top-level scope at none:2
[9] eval at ./boot.jl:319 [inlined]
[10] eval(::Expr) at ./client.jl:393
[11] top-level scope at ./none:3
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/zlib_h.jl:13
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/GZip/JNmGn/src/GZip.jl:73
ERROR: LoadError: LoadError: LoadError: Failed to precompile GZip [92fee26a-97fe-5a0c-ad85-20a5f3185b63] to /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/compiled/v1.0/GZip/s2LKY.ji.
Stacktrace:
[1] error(::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[2] compilecache(::Base.PkgId, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[3] _require(::Base.PkgId) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[4] require(::Base.PkgId) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:? (repeats 2 times)
[5] include at ./boot.jl:317 [inlined]
[6] include_relative(::Module, ::String) at ./loading.jl:1044
[7] include at ./sysimg.jl:29 [inlined]
[8] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/MNIST.jl:26
[9] top-level scope at none:0
[10] include at ./boot.jl:317 [inlined]
[11] include_relative(::Module, ::String) at ./loading.jl:1044
[12] include at ./sysimg.jl:29 [inlined]
[13] include(::String) at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MLDatasets.jl:1
[14] top-level scope at none:0
[15] include_relative(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[16] include(::Module, ::String) at /Applications/JuliaPro-1.0.5-2.app/Contents/Resources/julia/Contents/Resources/julia/lib/julia/sys.dylib:?
[17] top-level scope at none:2
[18] eval at ./boot.jl:319 [inlined]
[19] eval(::Expr) at ./client.jl:393
[20] top-level scope at ./none:3
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/Reader/Reader.jl:2
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MNIST/MNIST.jl:70
in expression starting at /Users/hannah/.juliapro/JuliaPro_v1.0.5-2/packages/MLDatasets/GU5Hj/src/MLDatasets.jl:45

I already tried adding zlib-ng with home-brew, but it didn't work. Would appreciate any suggestions :)
Thanks!

Titanic Dataset values are in the wrong order

Behavior

Calling the Titanic.features() returns a matrix not in the prescribed order of Titanic.feature_names().

using MLDatasets: Titanic

features = Titanic.features()

returns the following matrix:

11×891 Matrix{Any}:
  1  12  23  34  45  56  67  78  89  100  111  122  133  …  "S"  "S"  "S"  "C"  "S"  "Q"  "S"  "C"  "C"  "S"  "S"
  2  13  24  35  46  57  68  79  90  101  112  123  134     "S"  "S"  "C"  "S"  "S"  "S"  "S"  "S"  "C"  "S"  "S"
  3  14  25  36  47  58  69  80  91  102  113  124  135     "S"  "S"  "S"  "S"  "S"  "C"  "S"  "C"  "S"  "S"  "S"
  4  15  26  37  48  59  70  81  92  103  114  125  136     "C"  "S"  "S"  "S"  "C"  "Q"  "C"  "S"  "S"  "S"  "S"
  5  16  27  38  49  60  71  82  93  104  115  126  137     "S"  "S"  "S"  "S"  "S"  ""   "S"  "S"  "S"  "S"  "S"
  6  17  28  39  50  61  72  83  94  105  116  127  138  …  "S"  "S"  "S"  "S"  "S"  "C"  "S"  "C"  "S"  "C"  "Q"
  7  18  29  40  51  62  73  84  95  106  117  128  139     "Q"  "Q"  "C"  "S"  "S"  "S"  "C"  "S"  "S"  "C"  "S"
  8  19  30  41  52  63  74  85  96  107  118  129  140     "S"  "S"  "S"  "S"  "S"  "C"  "C"  "S"  "S"  "S"  "S"
  9  20  31  42  53  64  75  86  97  108  119  130  141     "Q"  "C"  "S"  "S"  "S"  "S"  "S"  "S"  "C"  "S"  "S"
 10  21  32  43  54  65  76  87  98  109  120  131  142     "S"  "Q"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "C"
 11  22  33  44  55  66  77  88  99  110  121  132  143  …  "C"  "S"  "S"  "S"  "S"  "C"  "S"  "S"  "S"  "C"  "Q"

Upon observation, it seems that the CSV file is read in a sequential manner and values are placed in the wrong matrix elements.

Expected Behavior

A call to Titanic.features() should return a matrix of the form

1 0 3 "Braund, Mr. Owen Harris" ... 7.25 "" S
2 1 1 "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" ... 71.2833 C85 C

Environment

OS: Ubuntu 20.04 WSL2
Julia 1.7.2
MLDatasets.jl 0.5.15

Large datasets

How do we deal with datasets that are too large to be read into an Array? Something of 5-50Gb, for example. Are there any tools for it or earlier discussion?

I thought about:

Iterators instead of arrays. Pros: simple. Cons: some tools (e.g. from MLDataUtils) may require random access to elements of a dataset.
New array type with lazy data loading. Maybe memory-mapped array, maybe something more custom. Pros: exposes AbstractArray interface, so existing tools will work. Cons: some tools and algorithms may expect data to be in memory while for disk-based arrays their performance will drop drastically.
Completely custom interface. PyTorch's datasets/dataloaders may be a good example. Pros: flexible, easy to provide fast access. Cons: most functions from MLDataUtils will break.

write datasets in a JLD2 or Arrow format for faster read

We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g. d = MNIST() we just load the JLD2 file.

Example:

function MNIST(...)
    dataset_dir = ...
    processed_file = joinpath(dataset_dir, "processed", "dataset.jld2") 
    if isfile(processed_file) 
        return FileIO.load(processed_file, "dataset")
    end 

    mnist = ...
    if isfile(processed_file) 
        FileIO.save(processed_file, Dict("dataset" => mnist))
    end 
    return mnist
end

MNIST.convert2features() gives DimensionMismatch error

The command found on the package documentation

MNIST.convert2features(MNIST.traintensor())

now produces the error

┌ Warning: convert2features is deprecated, use reshape instead.
│ caller = top-level scope at In[7]:1
└ @ Core In[7]:1
DimensionMismatch("parent has 47040000 elements, which is incompatible with size ()")

Using Julia 1.4

Thanks

Links to other dataset repositories

We can add a page to the docs linking to other repos that make it easy to download datasets in julia. Examples:

Could not load symbol "gzopen64"

I got the following error message from tst_mnist.jl when I ran ] test MLDatasets:

Got exception outside of a @test could not load symbol "gzopen64": dlsym(0xfff153c1b3a0, gzopen64): symbol not found

I tried reinstalling the package but it didn't work.
My OS is MacOS Monterey 12.0.1, and I am using Julia v1.6.3
with MLDatasets version v0.5.13.

All links to datasets documentation in the home page are broken

All links in the "datasets" section of the home page (that should point to the specific dataset documentation) are broken (404 not found error), e.g. https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/

Deprecations v0.6

Make sure that most v0.5 code (e.g. MNIST.traindata(Float32)) keeps working in v0.6 but with a deprecation warning.

Related to #73.

set ENV["DATADEPS_ALWAYS_ACCEPT"] = true

Having to say "yes" each time seems just an annoyance.

I think this should be considered a breaking change.

documentation failed to the deploy

@johnnychen94 last commit failed to deploy the docs
https://github.com/JuliaML/MLDatasets.jl/runs/3172885457
Something is wrong with the DOCUMENTER_KEY, do you know how to fix it?

Add OGB dataset

The OGB datasets is an important graph benchmark database.

The Open Graph Benchmark (OGB) aims to provide graph datasets that cover important graph machine learning tasks, diverse dataset scale, and rich domains.

Hope can join it here.

Segmentation fault with Images

I do not know if this issue is related to MLDatasets or Images, but trying to using MLDatasets, Images results in segmentation fault for me.

Package versions (Julia 1.6.0):

Images v0.24.1
MLDatasets v0.5.6

Feature request: ImageNet data loader

ImageNet is quite large and locked behind terms of access that require an account.

However it would be nice to be able to either

set a config (or ENV) variable to download ImageNet through MLDatasets
point MLDatasets to a local copy of ImageNet

and be able to use MLDatasets' interface of

train_x, train_y = ImageNet.traindata()
test_x,  test_y  = ImageNet.testdata()

as well as ImageNet.convert2image(x).
Ideally data would be in WHCN format for Flux and Metalhead models.

unpack compressed files instead of streaming through them

We could drop the dependence on GZip, which is not actively maintained, by just calling DataDeps.unpack (which relies on the p7zip binary) to uncompress files. It would be easy to change the MNISTReader & co. to just do that.

This could solve issues like #118.

Additionally, we may want to go in the direction of saving processed versions of the datasets (e.g. a JLD2 save of the dataset object itself) for faster I/O.

(v1.1) pkg> add MLDatasets
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
 Installed GZip ─────── v0.5.0
 Installed DataDeps ─── v0.6.2
 Installed MLDatasets ─ v0.3.0
  Updating `~/.julia/environments/v1.1/Project.toml`
  [eb30cadb] + MLDatasets v0.3.0
  Updating `~/.julia/environments/v1.1/Manifest.toml`
  [124859b0] + DataDeps v0.6.2
  [92fee26a] + GZip v0.5.0
  [eb30cadb] + MLDatasets v0.3.0

julia> using MLDatasets
[ Info: Precompiling MLDatasets [eb30cadb-4394-5ae3-aed4-317e484a6458]
ERROR: LoadError: LoadError: error compiling top-level scope: could not load library "libz"
libz.so: cannot open shared object file: No such file or directory
Stacktrace:
 [1] include at ./boot.jl:326 [inlined]
 [2] include_relative(::Module, ::String) at ./loading.jl:1038
 [3] include at ./sysimg.jl:29 [inlined]
 [4] include(::String) at /home/dicbro/.julia/packages/GZip/LD2ly/src/GZip.jl:2
 [5] top-level scope at none:0
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] top-level scope at none:2
 [10] eval at ./boot.jl:328 [inlined]
 [11] eval(::Expr) at ./client.jl:404
 [12] top-level scope at ./none:3
in expression starting at /home/dicbro/.julia/packages/GZip/LD2ly/src/zlib_h.jl:11
in expression starting at /home/dicbro/.julia/packages/GZip/LD2ly/src/GZip.jl:73
ERROR: LoadError: LoadError: LoadError: Failed to precompile GZip [92fee26a-97fe-5a0c-ad85-20a5f3185b63] to /home/dicbro/.julia/compiled/v1.1/GZip/s2LKY.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include at ./sysimg.jl:29 [inlined]
 [9] include(::String) at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/MNIST.jl:26
 [10] top-level scope at none:0
 [11] include at ./boot.jl:326 [inlined]
 [12] include_relative(::Module, ::String) at ./loading.jl:1038
 [13] include at ./sysimg.jl:29 [inlined]
 [14] include(::String) at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MLDatasets.jl:1
 [15] top-level scope at none:0
 [16] include at ./boot.jl:326 [inlined]
 [17] include_relative(::Module, ::String) at ./loading.jl:1038
 [18] include(::Module, ::String) at ./sysimg.jl:29
 [19] top-level scope at none:2
 [20] eval at ./boot.jl:328 [inlined]
 [21] eval(::Expr) at ./client.jl:404
 [22] top-level scope at ./none:3
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/Reader/Reader.jl:2
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MNIST/MNIST.jl:70
in expression starting at /home/dicbro/.julia/packages/MLDatasets/yNB45/src/MLDatasets.jl:45
ERROR: Failed to precompile MLDatasets [eb30cadb-4394-5ae3-aed4-317e484a6458] to /home/dicbro/.julia/compiled/v1.1/MLDatasets/9CUQK.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853