Coder Social home page Coder Social logo

mempool.jl's People

Contributors

andreasnoack avatar drchainsaw avatar jameswrigley avatar jeffbezanson avatar joshday avatar jpsamaroo avatar krynju avatar maximilianjhuber avatar michaelhatherly avatar shashi avatar tanmaykm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mempool.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

loadndsparse (JuliaDB) fails

see also https://discourse.julialang.org/t/juliadb-loading-data/26281/13
Pkg.status yields

[336ed68f] CSV v0.5.8
  [a93c6f00] DataFrames v0.18.4
  [a93385a2] JuliaDB v0.12.0+ #master (https://github.com/JuliaComputing/JuliaDB.jl.git)
  [f9f48841] MemPool v0.2.0+ #master (https://github.com/JuliaComputing/MemPool.jl.git)
  [bd369af6] Tables v0.2.8
  [e0df1984] TextParse v0.9.1

the code I ran is this

using Pkg 
#Pkg.@pkg_str("add MemPool#master")
using JuliaDB
using CSV
#using Tables
using DelimitedFiles
using TextParse
using DataFrames

]st

fileToBeRead="C:\\temp\\test0.csv"
bindir="c:\\temp\\bindata"

mt=rand(5_000,5);
df=DataFrame(mt)
df[:,3]=Int.(trunc.(Int,100*mt[:,3]));
df[:,4]=Int.(trunc.(Int,10000*mt[:,4]));

isfile(fileToBeRead)&&rm(fileToBeRead)
CSV.write(fileToBeRead,df)

#read file with CSV
df_read=CSV.read(fileToBeRead,types=[Float64,Float64,Int64,Int64,Float64]);
sum(df_read[1])
@assert Int==eltype(df_read[3]) #ok

#read file with JuliaDB
@time csvfiles = glob(fileToBeRead);

!isdir(bindir) && mkdir(bindir)
@time loadndsparse(csvfiles, output=bindir,
    header_exists=true,
    chunks=80,
    colparsers=Dict(1=>Float64, 2=>Float64, 3=>Int64,4=>Int64,5=>Float64),
    datacols=[1,2,3,4,5])
    

ERROR: UndefRefError: access to undefined reference
getproperty(::Any, ::Symbol) at .\sysimg.jl:18
get_wrkrips() at C:\Users\bernhard.konig\.julia\packages\MemPool\PUncN\src\datastore.jl:65
run_work_thunk(::typeof(MemPool.get_wrkrips), ::Bool) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\process_messages.jl:56
#remotecall_fetch#148(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:364
remotecall_fetch(::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:364
#remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406
remotecall_fetch at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406 [inlined]
get_workers_at(::Sockets.IPv4) at C:\Users\bernhard.konig\.julia\packages\MemPool\PUncN\src\datastore.jl:95
affinity(::MemPool.FileRef) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:84
affinity(::Dagger.Chunk{Any,MemPool.FileRef}) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:50
affinity(::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\thunk.jl:52
pop_with_affinity!(::Dagger.Context, ::Array{Dagger.Thunk,1}, ::Dagger.OSProc, ::Bool) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:97
compute_dag(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:36
compute(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\compute.jl:25
#fromchunks#47(::Nothing, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:148
fromchunks(::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:129
offset_index!(::JuliaDB.DNDSparse{Tuple{Int64},NamedTuple{(:x1, :x2, :x3, :x4, :x5),Tuple{Float64,Float64,Int64,Int64,Float64}}}, ::Int64) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:28
#_loadtable#188(::Int64, ::String, ::Bool, ::Array{Any,1}, ::Bool, ::Bool, ::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:header_exists, :colparsers, :datacols),Tuple{Bool,Dict{Int64,DataType},Array{Int64,1}}}}, ::Function, ::Type, ::Array{String,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:153
#_loadtable at .\none:0 [inlined]
#loadndsparse#187 at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:82 [inlined]
(::getfield(JuliaDB, Symbol("#kw##loadndsparse")))(::NamedTuple{(:output, :header_exists, :chunks, :colparsers, :datacols),Tuple{String,Bool,Int64,Dict{Int64,DataType},Array{Int64,1}}}, ::typeof(loadndsparse), ::Array{String,1}) at .\none:0
top-level scope at util.jl:156
eval(::Module, ::Any) at .\boot.jl:319
eval_user_input(::Any, ::REPL.REPLBackend) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:85
macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:117 [inlined]
(::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})() at .\task.jl:259
Stacktrace:
 [1] #remotecall_fetch#148(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:365
 [2] remotecall_fetch(::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:364
 [3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406
 [4] remotecall_fetch at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406 [inlined]
 [5] get_workers_at(::Sockets.IPv4) at C:\Users\bernhard.konig\.julia\packages\MemPool\PUncN\src\datastore.jl:95
 [6] affinity(::MemPool.FileRef) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:84
 [7] affinity(::Dagger.Chunk{Any,MemPool.FileRef}) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:50
 [8] affinity(::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\thunk.jl:52
 [9] pop_with_affinity!(::Dagger.Context, ::Array{Dagger.Thunk,1}, ::Dagger.OSProc, ::Bool) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:97
 [10] compute_dag(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:36
 [11] compute(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\compute.jl:25
 [12] #fromchunks#47(::Nothing, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:148
 [13] fromchunks(::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:129
 [14] offset_index!(::JuliaDB.DNDSparse{Tuple{Int64},NamedTuple{(:x1, :x2, :x3, :x4, :x5),Tuple{Float64,Float64,Int64,Int64,Float64}}}, ::Int64) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:28
 [15] #_loadtable#188(::Int64, ::String, ::Bool, ::Array{Any,1}, ::Bool, ::Bool, ::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:header_exists, :colparsers, :datacols),Tuple{Bool,Dict{Int64,DataType},Array{Int64,1}}}}, ::Function, ::Type, ::Array{String,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:153
 [16] #_loadtable at .\none:0 [inlined]
 [17] #loadndsparse#187 at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:82 [inlined]
 [18] (::getfield(JuliaDB, Symbol("#kw##loadndsparse")))(::NamedTuple{(:output, :header_exists, :chunks, :colparsers, :datacols),Tuple{String,Bool,Int64,Dict{Int64,DataType},Array{Int64,1}}}, ::typeof(loadndsparse), ::Array{String,1}) at .\none:0
 [19] top-level scope at util.jl:156

Indefinite number of fields error with some Union types

Simplest reproducer (initially surfaced when passing some NCASubjects through Dagger) I can come up with:

julia> VERSION
v"1.5.3-pre.0"

(Sandbox) pkg> st MemPool
Project Sandbox v0.1.0
Status `~/code/julia/Sandbox/Project.toml`
  [f9f48841] MemPool v0.3.2

julia> using MemPool

julia> poolset(Union{Nothing,Vector{Float64}}[nothing])
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
 [1] fieldcount(::Any) at ./reflection.jl:703
 [2] fixedlength(::Type{T} where T, ::IdDict{Any,Any}) at /home/mike/.julia/packages/MemPool/F33TL/src/io.jl:158
 [3] fixedlength(::Type{T} where T) at /home/mike/.julia/packages/MemPool/F33TL/src/io.jl:148
 [4] approx_size(::Type{T} where T, ::Int64, ::Array{Union{Nothing, Array{Float64,1}},1}) at /home/mike/.julia/packages/MemPool/F33TL/src/MemPool.jl:79
 [5] approx_size(::Array{Union{Nothing, Array{Float64,1}},1}) at /home/mike/.julia/packages/MemPool/F33TL/src/MemPool.jl:75
 [6] poolset(::Any, ::Int64) at /home/mike/.julia/packages/MemPool/F33TL/src/datastore.jl:116 (repeats 2 times)
 [7] top-level scope at REPL[6]:1

Doesn't show up when it's a union of bitstypes like Union{Nothing,Float64}, so I'd guess it's got something to do with https://github.com/JuliaComputing/MemPool.jl/blob/a279024c8c6a3dba246b2303808f5fdd6bbe247a/src/io.jl#L150 perhaps?

Add mmread and mmwrite for Union{Missing,T} eltypes

#22 should make MemPool work for Vector{Missing.T}. At least in some cases. However, it will be using the slow fallback serialization so eventually, we should add faster serialization of Vector{Missing,T} since they will probably be common.

movetodisk only moves the data once

Maybe it is just me who misunderstands the intended usage as from the code it is quite clear this is deliberate;

julia> r = MemPool.poolset(12);

julia> isnothing(MemPool.datastore[r.id].data)
false

julia> MemPool.movetodisk(r);

julia> isnothing(MemPool.datastore[r.id].data) # Data is moved the first time
true

julia> MemPool.poolget(r)
12

julia> MemPool.movetodisk(r);

julia> isnothing(MemPool.datastore[r.id].data) # Data is not moved the second time (file still exists)
false

julia> MemPool.pooldelete(MemPool.movetodisk(r)) # This deletes the file, but keeps r in the pool so that movetodisk does its thing again. Intended usage?

julia> MemPool.movetodisk(r);

julia> isnothing(MemPool.datastore[r.id].data)
true

julia> MemPool.poolget(r)
12

JuliaDB setcol yields MomPool errors

I have a table:

Distributed Table with 169678550 rows in 34 chunks:
Columns:
#  colname  type
────────────────────────────────────────────────
1  nr     Int32
2  str    String

The str only takes 23 different values and I have a Dict that translates them into an Int8:

setcol(t, :str, :str=>c -> dictionary[c]);

yields:

On worker 2:
KeyError: key "" not found

but unique(collect(select(t, :str))) just shows the 23 different values and "" is not one of them. If I manually add "" to the Dict with some arbitrary value I get:

On worker 5:
ArgumentError: Reference array points beyond the end of the pool

JuliaDB columns with Union{Missing, T}

I've run into a few places where the case of Union types are not handled (one is fixed in #20).

Another is here (type Union has no field types): https://github.com/JuliaComputing/MemPool.jl/blob/feacabb3392b7b3dae059dc9d9f37f7a3a9f6c1a/src/io.jl#L188-L207

A lot of the logic with isbitstype needs to get combined with Base.isbitsunion (thanks to @JeffBezanson for helping me with that in #20)

EDIT: I came across the above example while trying to save a JuliaDB.NextTable with Union columns to disk.

MemPool >= 0.3.7 causes errors on Julia < 1.4

GC.safepoint was added in Julia 1.4. a7f5703 adds a call to GC.safepoint(). Package.toml suggests this package should be compatible with Julia >= 1.0.

I'm not sure why the GC.safepoint() call was added. If it was to increase GC performance, I'd suggest just wrapping it with something like this:

@static if VERSION >= v"1.4"
  GC.safepoint()
end

Otherwise, perhaps Project.toml should be updated to indicate compatibility with only Julia >= 1.4.

(btw, I discovered this working on fixing tests in another package that should support older Julia, I'm not using Julia < 1.4 in production).

Add mechanism to generate DRef from file

At the moment, creating a new DRef requires the associated data to be in memory. This makes it annoying to persist datasets and be able to easily re-load them in a future session. We should provide a means to generate a new DRef from a file stored on a device, and provide some high-level API (maybe another kwarg to poolset) to expose this.

UndefVarError: T not defined

When I load a table that I saved with an old version of JuliaDB using master versions of JuliaDB, MemPool, IndexedTables and Julia 1.0.3:

UndefVarError: T not defined

Stacktrace:
 [1] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{StructArrays.StructArray{T,N,C} where C<:NamedTuple where N}}) at C:\Users\Max\.julia\packages\MemPool\tlPqB\src\io.jl:27
 [2] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
 [3] mmread(::Type{IndexedTable}, ::Serialization.Serializer{IOStream}, ::Bool) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
 [4] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{IndexedTable}}) at C:\Users\Max\.julia\packages\MemPool\tlPqB\src\io.jl:27
 [5] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
 [6] deserialize(::Serialization.Serializer{IOStream}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
 [7] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:801
 [8] deserialize(::Serialization.Serializer{IOStream}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
 [9] deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:708 [inlined]
 [10] #open#294(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::typeof(Serialization.deserialize), ::String) at .\iostream.jl:369
 [11] open at .\iostream.jl:367 [inlined]
 [12] load(::String) at C:\Users\Max\.julia\packages\JuliaDB\t5MGj\src\io.jl:181
 [13] top-level scope at In[1]:2

Any ideas?

Management of files

Maybe first question to ask is if this package is intended for people like me who only wish to use the disk as an extension of the RAM?

Anyways, it seems like the files created needs to be manually deleted (e.g. by calling pooldelete) when no longer needed, or?

If I want files to be deleted when the DRef is no longer used, is it correct to wrap the DRef in a Ref and register a finalizer for that Ref (while ofc making sure all other parts of the program only interact with the referenced data through some struct wrapping the Ref)?

It seems to work in casual testing but I'm afraid there might be hidden pitfalls with it when let loose in the wild.

README still mentions LRU functionality, which is disabled

The "Usage" section in the README starts off talking about MemPool's LRU functionality, however such functionality was disabled some months ago. We should either remove this text or re-enable the LRU; which of these is the more desirable option?

Caching error on linux when removing cache files

Appears sometimes when process exits

IOError: unlink("/home/krynju/.mempool/sess-utvz1V-1/h2x1LD/jl_N2bctMjqbi"): no such file or directory (ENOENT)
Stacktrace:
 [1] uv_error
   @ ./libuv.jl:97 [inlined]
 [2] unlink(p::String)
   @ Base.Filesystem ./file.jl:972
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem ./file.jl:283
 [4] rm(path::String; force::Bool, recursive::Bool) (repeats 2 times)
   @ Base.Filesystem ./file.jl:294
 [5] (::MemPool.var"#203#206"{Int64})()
   @ MemPool ~/.julia/packages/MemPool/Ggdm4/src/MemPool.jl:163
 [6] _atexit()
   @ Base ./initdefs.jl:372

MethodError

I try to load a table that I saved with a previous version of JuliaDB. It contains a column that is of type Int8. I get:

MethodError: no method matching PooledArrays.PooledArray(::PooledArrays.RefArray{Array{UInt8,1}}, ::Array{Int8,1})
[1] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{PooledArrays.PooledArray}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[2] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[3] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}, ::Int64, ::Int64) at .\array.jl:508
[4] collect_to!(::Array{Array{Int32,1},1}, ::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}, ::Int64, ::Int64) at .\array.jl:518
[5] collect(::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}) at .\array.jl:476
[6] mmread(::Type{IndexedTables.Columns}, ::SerializationState{IOStream}, ::Bool) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\serialize.jl:57
[7] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{IndexedTables.Columns}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[8] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[9] mmread(::Type{IndexedTables.NextTable}, ::SerializationState{IOStream}, ::Bool) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\serialize.jl:86
[10] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{IndexedTables.NextTable}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[11] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[12] open(::Base.Serializer.#deserialize, ::String) at .\iostream.jl:152
[13] load(::String) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\io.jl:185

I am using Julia 0.6.2 and:

- MemPool                       0.0.11
- JuliaDB                       0.8.4

Very slow approx_size for DataFrames

When benchmarking parallel application which uses Dagger, it seems like MemPool.approx_size is the bottleneck due to it falling back to Base.summarysize.

Here is a quick MWE:

julia>  using BenchmarkTools, DataFrames, MemPool

julia> df = DataFrame(a=1:1000_000, b=randn(1000_000), c=repeat([:aa], 1000_000));

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  61.03 MiB
  allocs estimate:  1999540
  --------------
  minimum time:     110.895 ms (4.59% GC)
  median time:      119.604 ms (2.47% GC)
  mean time:        122.978 ms (2.83% GC)
  maximum time:     146.009 ms (1.46% GC)
  --------------
  samples:          41
  evals/sample:     1

Here is a sketch of an alternative implementation which is much faster:

julia> function MemPool.approx_size(df::DataFrame)
       dsize = mapreduce(MemPool.approx_size, +, eachcol(df))
       namesize = mapreduce(MemPool.approx_size, +, names(df))
       return dsize + namesize
       end

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  13
  --------------
  minimum time:     535.700 μs (0.00% GC)
  median time:      636.800 μs (0.00% GC)
  mean time:        664.967 μs (0.00% GC)
  maximum time:     1.525 ms (0.00% GC)
  --------------
  samples:          7499
  evals/sample:     1

The above implementation is not 100% correct, but I hope it shows that there is some potential for improvement.

Don't know if there is some interface which can be used to avoid the dependency, e.g. Tables.jl.

0.3.1 error on loading a saved JuliaDB table

0.2.0 works just fine...

julia> JuliaDB.load(loaction);
ERROR: MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
 [1] mmread(::Type{Array{String,1}}, ::Serialization.Serializer{IOStream}, ::Bool) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:135
 [2] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{Array{String,1}}}) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:29
 [3] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:799
 [4] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
 [5] #30 at ./none:0 [inlined]
 [6] iterate at ./generator.jl:47 [inlined]
 [7] collect_to!(::Array{Array{Base.UUID,1},1}, ::Base.Generator{UnitRange{Int64},JuliaDB.var"#30#32"{Serialization.Serializer{IOStream}}}, ::Int64, ::Int64) at ./array.jl:711
 [8] collect_to_with_first!(::Array{Array{Base.UUID,1},1}, ::Array{Base.UUID,1}, ::Base.Generator{UnitRange{Int64},JuliaDB.var"#30#32"{Serialization.Serializer{IOStream}}}, ::Int64) at ./array.jl:689
 [9] collect(::Base.Generator{UnitRange{Int64},JuliaDB.var"#30#32"{Serialization.Serializer{IOStream}}}) at ./array.jl:670
 [10] mmread(::Type{StructArrays.StructArray{T,1,C,I} where I where C<:Union{Tuple, NamedTuple} where T}, ::Serialization.Serializer{IOStream}, ::Bool) at /home/user/.julia/packages/JuliaDB/7cG1k/src/serialize.jl:53
 [11] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{StructArrays.StructArray{T,1,C,I} where I where C<:Union{Tuple, NamedTuple} where T}}) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:29
 [12] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:799
 [13] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
 [14] mmread(::Type{IndexedTable}, ::Serialization.Serializer{IOStream}, ::Bool) at /home/user/.julia/packages/JuliaDB/7cG1k/src/serialize.jl:84
 [15] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{IndexedTable}}) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:29
 [16] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:799
 [17] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
 [18] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:838
 [19] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
 [20] deserialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:722 [inlined]
 [21] open(::typeof(Serialization.deserialize), ::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:298
 [22] open at ./io.jl:296 [inlined]
 [23] load(::String; procs::Array{Int64,1}) at /home/user/.julia/packages/JuliaDB/7cG1k/src/io.jl:182
 [24] load at /home/user/.julia/packages/JuliaDB/7cG1k/src/io.jl:174 [inlined]

Fix double-writing

Example situation:

  • A 1GB table with 10 columns is created
  • An operation creates a different object with the same 10 columns (e.g. rows(t)), MemPool thinks that this operation needs to free up 1 GB of space, starts evicting objects to disk

The problem is that MemPool is not accounting for the vectors it writes to disk.

Based on @tanmaykm's suggested fix:

  • designate every vector with an ID when it gets written to wire or disk using MemPool
  • keep a shared dictionary which maintains a ref-count of each vector using its ID.
  • when writing a vector to disk to evict it from working memory, store the file and offset in the shared dictionary, point to offset and previous file name instead of writing the vector to the spilled object.

This has a few problems:

  • shared dictionary in a cluster is still not a thing
  • When only a vector is required from within a table, you still have to keep the whole file containing the table around. This involves pretty thorough bookkeeping. One solution is to write each vector into a separate file, but this can overwhelm a file system since it would increase the number of files. Another solution is to do manual page management.

Add data pinning

The API that poolget provides, while simple, is unfortunately harmful to swap-to-disk and related use cases, as it does not indicate when the returned data is no longer in use. In a sense, this is by design - Dagger does not know whether values returned by poolget escape via user code, and how long their lifetime might be, so it cannot reasonably assert that the data is no longer being accessed at some point in time.

However, we should not be forced to live without such guarantees, as it makes any improvements to the storage system questionably beneficial. Thankfully, with the advent of improved escape analysis in Julia's compiler, it should be possible for Dagger to sometimes determine the lifetime of a piece of data, and so communicate this information to MemPool. The form of this will result in at least one new public "unpin", function, which will pair with either poolget or a new "pin" function, to assert the end of the returned value's lifetime. At the point when memory is fully unpinned, MemPool will be free to take aggressive steps to delete the data and free up memory, instead of just hoping that the GC will do a good job. Of course, not all accesses may have known lifetime, so we'll need to allow such calls to disable pinning and revert back to GC-managed deallocation.

This mechanism can be beneficial for any kind of data which can be explicitly deleted, such as GPU arrays, or another user-defined type with an available memory free function.

Some MemPool error when benchmarking dtables

Leaving this here as it's the first time I see this - it continued computing the result afterwards, but on one worker only
Super rare

@@@ STARTED:         innerjoin_r_unique : 2021-12-30T14:01:24.962
      From worker 2:    Error in enqueued work:
      From worker 2:    peer 4 didn't connect to 2 within 59.999990940093994 secondsError in enqueued work:
      From worker 2:    On worker 4:
      From worker 2:    KeyError: key (4, 12) not found
      From worker 2:    Stacktrace:
      From worker 2:      [1] getindex
      From worker 2:        @ ./dict.jl:498 [inlined]
      From worker 2:      [2] canfree
      From worker 2:        @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:72
      From worker 2:      [3] #36
      From worker 2:        @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:209
      From worker 2:      [4] macro expansion
      From worker 2:        @ ~/.julia/packages/MemPool/wlrUg/src/lock.jl:42 [inlined]
      From worker 2:      [5] with_datastore_lock
      From worker 2:        @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:218
      From worker 2:      [6] pooltransfer_recv
      From worker 2:        @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:201
      From worker 2:      [7] #invokelatest#2
      From worker 2:        @ ./essentials.jl:731
      From worker 2:      [8] invokelatest
      From worker 2:        @ ./essentials.jl:729
      From worker 2:      [9] #120
      From worker 2:        @ /usr/local/julia/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:294
      From worker 2:     [10] run_work_thunk
      From worker 2:        @ /usr/local/julia/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:63
      From worker 2:     [11] run_work_thunk
      From worker 2:        @ /usr/local/julia/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:72
      From worker 2:     [12] #106

LoadError after upgrade to 0.0.7

After upgrading to 0.0.7 using JuliaDB yields:

INFO: Recompiling stale cache file C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\lib\v0.6\MemPool.ji for module MemPool.
WARNING: Module Compat with uuid 367821964343373 is missing from the cache.
This may mean module Compat does not support precompilation but is imported by a module that does.
ERROR: LoadError: LoadError: Declaring __precompile__(false) is not allowed in files that are being precompiled.
Stacktrace:
 [1] _require(::Symbol) at .\loading.jl:455
 [2] require(::Symbol) at .\loading.jl:405
 [3] _include_from_serialized(::String) at .\loading.jl:157
 [4] _require_from_serialized(::Int64, ::Symbol, ::String, ::Bool) at .\loading.jl:200
 [5] _require_search_from_serialized(::Int64, ::Symbol, ::String, ::Bool) at .\loading.jl:236
 [6] _require(::Symbol) at .\loading.jl:441
 [7] require(::Symbol) at .\loading.jl:405
 [8] include_from_node1(::String) at .\loading.jl:576
 [9] include(::String) at .\sysimg.jl:14
 [10] include_from_node1(::String) at .\loading.jl:576
 [11] include(::String) at .\sysimg.jl:14
 [12] anonymous at .\<missing>:2
while loading C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\datastore.jl, in expression starting on line 235
while loading C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\MemPool.jl, in expression starting on line 50

Add transactions

At the moment, our storage layer (in StorageState) is non-transactional, and so it's not possible to ensure that a certain set of operations occur without other (incompatible or performance-degrading) operations occurring in-between. This can make it impossible for the SimpleRecencyAllocator (which provides our swap-to-disk functionality) to be able to provide any guarantees that memory/disk usage actually falls within user-provided limits.

The obvious solution here is to add "transactions", where the set of operations to be performed in sequence are batched up and submitted as a single logical operation. Any transaction in progress can exclude other operations or transactions from occurring on the same piece(s) of data until the current transaction completes. Additionally, for performance, we can implement some logic to "merge" compatible operations across multiple transactions (or within the same transaction) when the end result will be identical to running the operations exclusively.

@dbedi3311

Julia 1.0.3 crashes loading saved JuliaDB table with mempool error

I am not sure whether this issue belongs here. On Julia 1.0.3, JuliaDB 0.10, and MemPool 0.1.2 (or master) I saved a table:

Table with 429003 rows, 7 columns:
Columns:
#  colname        type
──────────────────────────────────────
1  name1   String
2  name2    String
3  id1       Int64
4  id2      Int64
5  date1       Union{Missing, Date}
6  date2  Union{Missing, Date}
7  date3  Union{Missing, Date}

and when I load it back in Julia crashes with:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7fff76cf4300 -- memmove at C:\WINDOWS\System32\msvcrt.dll (unknown line)
in expression starting at no file:0
memmove at C:\WINDOWS\System32\msvcrt.dll (unknown line)
jl_pchar_to_string at /home/Administrator/buildbot/worker/package_win64/build/src\array.c:472
unsafe_string at .\strings\string.jl:53
unknown function (ip: 000000001276EBB0)
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
mmread at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:135
deserialize at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:27
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711 [inlined]
#30 at .\none:0 [inlined]
iterate at .\generator.jl:47 [inlined]
collect at .\array.jl:619
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
mmread at C:\Users\Max\.julia\packages\JuliaDB\R4e6y\src\serialize.jl:53
deserialize at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:27
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
mmread at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
deserialize at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:27
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:801
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184 [inlined]
jl_apply at /home/Administrator/buildbot/worker/package_win64/build/src\julia.h:1537 [inlined]
jl_invoke at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:56
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:708 [inlined]
#open#294 at .\iostream.jl:369
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184 [inlined]
jl_apply at /home/Administrator/buildbot/worker/package_win64/build/src\julia.h:1537 [inlined]
jl_invoke at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:56
open at .\iostream.jl:367 [inlined]
load at C:\Users\Max\.julia\packages\JuliaDB\R4e6y\src\io.jl:181
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
do_call at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:324
eval_value at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:430
eval_stmt_value at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:363 [inlined]
eval_body at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:678
jl_interpret_toplevel_thunk_callback at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:806
unknown function (ip: FFFFFFFFFFFFFFFE)
unknown function (ip: 0000000008BB720F)
unknown function (ip: FFFFFFFFFFFFFFFF)
jl_toplevel_eval_flex at /home/Administrator/buildbot/worker/package_win64/build/src\toplevel.c:805
jl_toplevel_eval_in at /home/Administrator/buildbot/worker/package_win64/build/src\builtins.c:622
eval at .\boot.jl:319
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
eval_user_input at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:85
macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:117 [inlined]
#28 at .\task.jl:259
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
jl_apply at /home/Administrator/buildbot/worker/package_win64/build/src\julia.h:1537 [inlined]
start_task at /home/Administrator/buildbot/worker/package_win64/build/src\task.c:268
Allocations: 38776166 (Pool: 38767273; Big: 8893); GC: 84

Removing non-default directory during `exit_hook`

rm(default_dir(); recursive=true)

In exit_hook, the default_dir() is removed. What about if another directory is used, e.g., by passing diskpath to DiskCacheConfig, which in turn is passed to setup_global_device!? Does that custom diskpath remain, i.e., need to be manually deleted? And is this a bug or a feature?

`deleteat!` fails in `sra_migrate!`

foreach(idx->deleteat!(from_refs, idx), reverse(to_delete))

The call to reverse above assumes to_delete is in ascending order, which isn't necessarily the case. I didn't thoroughly test, so I don't know what combinations of to_mem and sra.policy cause an error, but I got an error about trying to delete an index that didn't exist when using a LRU policy (don't know if to_mem was true or not). The problem did not occur when I switched to a MRU policy. Sorry I don't have a MWE, though I can reliably reproduce the issue.

My problem went away when I replaced reverse(to_delete) with sort(to_delete; rev = true).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.