juliadata / mempool.jl Goto Github PK
View Code? Open in Web Editor NEWHigh-performance parallel and distributed datastore for Julia
License: Other
High-performance parallel and distributed datastore for Julia
License: Other
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
see also https://discourse.julialang.org/t/juliadb-loading-data/26281/13
Pkg.status yields
[336ed68f] CSV v0.5.8
[a93c6f00] DataFrames v0.18.4
[a93385a2] JuliaDB v0.12.0+ #master (https://github.com/JuliaComputing/JuliaDB.jl.git)
[f9f48841] MemPool v0.2.0+ #master (https://github.com/JuliaComputing/MemPool.jl.git)
[bd369af6] Tables v0.2.8
[e0df1984] TextParse v0.9.1
the code I ran is this
using Pkg
#Pkg.@pkg_str("add MemPool#master")
using JuliaDB
using CSV
#using Tables
using DelimitedFiles
using TextParse
using DataFrames
]st
fileToBeRead="C:\\temp\\test0.csv"
bindir="c:\\temp\\bindata"
mt=rand(5_000,5);
df=DataFrame(mt)
df[:,3]=Int.(trunc.(Int,100*mt[:,3]));
df[:,4]=Int.(trunc.(Int,10000*mt[:,4]));
isfile(fileToBeRead)&&rm(fileToBeRead)
CSV.write(fileToBeRead,df)
#read file with CSV
df_read=CSV.read(fileToBeRead,types=[Float64,Float64,Int64,Int64,Float64]);
sum(df_read[1])
@assert Int==eltype(df_read[3]) #ok
#read file with JuliaDB
@time csvfiles = glob(fileToBeRead);
!isdir(bindir) && mkdir(bindir)
@time loadndsparse(csvfiles, output=bindir,
header_exists=true,
chunks=80,
colparsers=Dict(1=>Float64, 2=>Float64, 3=>Int64,4=>Int64,5=>Float64),
datacols=[1,2,3,4,5])
ERROR: UndefRefError: access to undefined reference
getproperty(::Any, ::Symbol) at .\sysimg.jl:18
get_wrkrips() at C:\Users\bernhard.konig\.julia\packages\MemPool\PUncN\src\datastore.jl:65
run_work_thunk(::typeof(MemPool.get_wrkrips), ::Bool) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\process_messages.jl:56
#remotecall_fetch#148(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:364
remotecall_fetch(::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:364
#remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406
remotecall_fetch at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406 [inlined]
get_workers_at(::Sockets.IPv4) at C:\Users\bernhard.konig\.julia\packages\MemPool\PUncN\src\datastore.jl:95
affinity(::MemPool.FileRef) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:84
affinity(::Dagger.Chunk{Any,MemPool.FileRef}) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:50
affinity(::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\thunk.jl:52
pop_with_affinity!(::Dagger.Context, ::Array{Dagger.Thunk,1}, ::Dagger.OSProc, ::Bool) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:97
compute_dag(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:36
compute(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\compute.jl:25
#fromchunks#47(::Nothing, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:148
fromchunks(::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:129
offset_index!(::JuliaDB.DNDSparse{Tuple{Int64},NamedTuple{(:x1, :x2, :x3, :x4, :x5),Tuple{Float64,Float64,Int64,Int64,Float64}}}, ::Int64) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:28
#_loadtable#188(::Int64, ::String, ::Bool, ::Array{Any,1}, ::Bool, ::Bool, ::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:header_exists, :colparsers, :datacols),Tuple{Bool,Dict{Int64,DataType},Array{Int64,1}}}}, ::Function, ::Type, ::Array{String,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:153
#_loadtable at .\none:0 [inlined]
#loadndsparse#187 at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:82 [inlined]
(::getfield(JuliaDB, Symbol("#kw##loadndsparse")))(::NamedTuple{(:output, :header_exists, :chunks, :colparsers, :datacols),Tuple{String,Bool,Int64,Dict{Int64,DataType},Array{Int64,1}}}, ::typeof(loadndsparse), ::Array{String,1}) at .\none:0
top-level scope at util.jl:156
eval(::Module, ::Any) at .\boot.jl:319
eval_user_input(::Any, ::REPL.REPLBackend) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:85
macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:117 [inlined]
(::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})() at .\task.jl:259
Stacktrace:
[1] #remotecall_fetch#148(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:365
[2] remotecall_fetch(::Function, ::Distributed.LocalProcess) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:364
[3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406
[4] remotecall_fetch at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:406 [inlined]
[5] get_workers_at(::Sockets.IPv4) at C:\Users\bernhard.konig\.julia\packages\MemPool\PUncN\src\datastore.jl:95
[6] affinity(::MemPool.FileRef) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:84
[7] affinity(::Dagger.Chunk{Any,MemPool.FileRef}) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\chunks.jl:50
[8] affinity(::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\thunk.jl:52
[9] pop_with_affinity!(::Dagger.Context, ::Array{Dagger.Thunk,1}, ::Dagger.OSProc, ::Bool) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:97
[10] compute_dag(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\scheduler.jl:36
[11] compute(::Dagger.Context, ::Dagger.Thunk) at C:\Users\bernhard.konig\.julia\packages\Dagger\sdZXi\src\compute.jl:25
[12] #fromchunks#47(::Nothing, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:148
[13] fromchunks(::Array{Dagger.Thunk,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\table.jl:129
[14] offset_index!(::JuliaDB.DNDSparse{Tuple{Int64},NamedTuple{(:x1, :x2, :x3, :x4, :x5),Tuple{Float64,Float64,Int64,Int64,Float64}}}, ::Int64) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:28
[15] #_loadtable#188(::Int64, ::String, ::Bool, ::Array{Any,1}, ::Bool, ::Bool, ::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:header_exists, :colparsers, :datacols),Tuple{Bool,Dict{Int64,DataType},Array{Int64,1}}}}, ::Function, ::Type, ::Array{String,1}) at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:153
[16] #_loadtable at .\none:0 [inlined]
[17] #loadndsparse#187 at C:\Users\bernhard.konig\.julia\packages\JuliaDB\PRJbx\src\io.jl:82 [inlined]
[18] (::getfield(JuliaDB, Symbol("#kw##loadndsparse")))(::NamedTuple{(:output, :header_exists, :chunks, :colparsers, :datacols),Tuple{String,Bool,Int64,Dict{Int64,DataType},Array{Int64,1}}}, ::typeof(loadndsparse), ::Array{String,1}) at .\none:0
[19] top-level scope at util.jl:156
Simplest reproducer (initially surfaced when passing some NCASubject
s through Dagger) I can come up with:
julia> VERSION
v"1.5.3-pre.0"
(Sandbox) pkg> st MemPool
Project Sandbox v0.1.0
Status `~/code/julia/Sandbox/Project.toml`
[f9f48841] MemPool v0.3.2
julia> using MemPool
julia> poolset(Union{Nothing,Vector{Float64}}[nothing])
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
[1] fieldcount(::Any) at ./reflection.jl:703
[2] fixedlength(::Type{T} where T, ::IdDict{Any,Any}) at /home/mike/.julia/packages/MemPool/F33TL/src/io.jl:158
[3] fixedlength(::Type{T} where T) at /home/mike/.julia/packages/MemPool/F33TL/src/io.jl:148
[4] approx_size(::Type{T} where T, ::Int64, ::Array{Union{Nothing, Array{Float64,1}},1}) at /home/mike/.julia/packages/MemPool/F33TL/src/MemPool.jl:79
[5] approx_size(::Array{Union{Nothing, Array{Float64,1}},1}) at /home/mike/.julia/packages/MemPool/F33TL/src/MemPool.jl:75
[6] poolset(::Any, ::Int64) at /home/mike/.julia/packages/MemPool/F33TL/src/datastore.jl:116 (repeats 2 times)
[7] top-level scope at REPL[6]:1
Doesn't show up when it's a union of bitstypes like Union{Nothing,Float64}
, so I'd guess it's got something to do with https://github.com/JuliaComputing/MemPool.jl/blob/a279024c8c6a3dba246b2303808f5fdd6bbe247a/src/io.jl#L150 perhaps?
#22 should make MemPool work for Vector{Missing.T}
. At least in some cases. However, it will be using the slow fallback serialization so eventually, we should add faster serialization of Vector{Missing,T}
since they will probably be common.
Maybe it is just me who misunderstands the intended usage as from the code it is quite clear this is deliberate;
julia> r = MemPool.poolset(12);
julia> isnothing(MemPool.datastore[r.id].data)
false
julia> MemPool.movetodisk(r);
julia> isnothing(MemPool.datastore[r.id].data) # Data is moved the first time
true
julia> MemPool.poolget(r)
12
julia> MemPool.movetodisk(r);
julia> isnothing(MemPool.datastore[r.id].data) # Data is not moved the second time (file still exists)
false
julia> MemPool.pooldelete(MemPool.movetodisk(r)) # This deletes the file, but keeps r in the pool so that movetodisk does its thing again. Intended usage?
julia> MemPool.movetodisk(r);
julia> isnothing(MemPool.datastore[r.id].data)
true
julia> MemPool.poolget(r)
12
I have a table:
Distributed Table with 169678550 rows in 34 chunks:
Columns:
# colname type
────────────────────────────────────────────────
1 nr Int32
2 str String
The str
only takes 23 different values and I have a Dict that translates them into an Int8
:
setcol(t, :str, :str=>c -> dictionary[c]);
yields:
On worker 2:
KeyError: key "" not found
but unique(collect(select(t, :str)))
just shows the 23 different values and ""
is not one of them. If I manually add ""
to the Dict
with some arbitrary value I get:
On worker 5:
ArgumentError: Reference array points beyond the end of the pool
I've run into a few places where the case of Union types are not handled (one is fixed in #20).
Another is here (type Union has no field types
): https://github.com/JuliaComputing/MemPool.jl/blob/feacabb3392b7b3dae059dc9d9f37f7a3a9f6c1a/src/io.jl#L188-L207
A lot of the logic with isbitstype
needs to get combined with Base.isbitsunion
(thanks to @JeffBezanson for helping me with that in #20)
EDIT: I came across the above example while trying to save a JuliaDB.NextTable
with Union
columns to disk.
This ccall is broken on master, but is also unnecessary (since Julia v1.6)
Lines 28 to 29 in f62e3ec
GC.safepoint
was added in Julia 1.4. a7f5703 adds a call to GC.safepoint()
. Package.toml
suggests this package should be compatible with Julia >= 1.0.
I'm not sure why the GC.safepoint()
call was added. If it was to increase GC performance, I'd suggest just wrapping it with something like this:
@static if VERSION >= v"1.4"
GC.safepoint()
end
Otherwise, perhaps Project.toml
should be updated to indicate compatibility with only Julia >= 1.4.
(btw, I discovered this working on fixing tests in another package that should support older Julia, I'm not using Julia < 1.4 in production).
At the moment, creating a new DRef
requires the associated data to be in memory. This makes it annoying to persist datasets and be able to easily re-load them in a future session. We should provide a means to generate a new DRef
from a file stored on a device, and provide some high-level API (maybe another kwarg to poolset
) to expose this.
When I load a table that I saved with an old version of JuliaDB using master versions of JuliaDB, MemPool, IndexedTables and Julia 1.0.3:
UndefVarError: T not defined
Stacktrace:
[1] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{StructArrays.StructArray{T,N,C} where C<:NamedTuple where N}}) at C:\Users\Max\.julia\packages\MemPool\tlPqB\src\io.jl:27
[2] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
[3] mmread(::Type{IndexedTable}, ::Serialization.Serializer{IOStream}, ::Bool) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
[4] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{IndexedTable}}) at C:\Users\Max\.julia\packages\MemPool\tlPqB\src\io.jl:27
[5] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
[6] deserialize(::Serialization.Serializer{IOStream}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
[7] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:801
[8] deserialize(::Serialization.Serializer{IOStream}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
[9] deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:708 [inlined]
[10] #open#294(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::typeof(Serialization.deserialize), ::String) at .\iostream.jl:369
[11] open at .\iostream.jl:367 [inlined]
[12] load(::String) at C:\Users\Max\.julia\packages\JuliaDB\t5MGj\src\io.jl:181
[13] top-level scope at In[1]:2
Any ideas?
Maybe first question to ask is if this package is intended for people like me who only wish to use the disk as an extension of the RAM?
Anyways, it seems like the files created needs to be manually deleted (e.g. by calling pooldelete) when no longer needed, or?
If I want files to be deleted when the DRef is no longer used, is it correct to wrap the DRef in a Ref and register a finalizer for that Ref (while ofc making sure all other parts of the program only interact with the referenced data through some struct wrapping the Ref)?
It seems to work in casual testing but I'm afraid there might be hidden pitfalls with it when let loose in the wild.
In a cluster spread across machines, with no shared storage, we need a way to fetch FileRefs that could be residing on a separate machine.
The "Usage" section in the README starts off talking about MemPool's LRU functionality, however such functionality was disabled some months ago. We should either remove this text or re-enable the LRU; which of these is the more desirable option?
Appears sometimes when process exits
IOError: unlink("/home/krynju/.mempool/sess-utvz1V-1/h2x1LD/jl_N2bctMjqbi"): no such file or directory (ENOENT)
Stacktrace:
[1] uv_error
@ ./libuv.jl:97 [inlined]
[2] unlink(p::String)
@ Base.Filesystem ./file.jl:972
[3] rm(path::String; force::Bool, recursive::Bool)
@ Base.Filesystem ./file.jl:283
[4] rm(path::String; force::Bool, recursive::Bool) (repeats 2 times)
@ Base.Filesystem ./file.jl:294
[5] (::MemPool.var"#203#206"{Int64})()
@ MemPool ~/.julia/packages/MemPool/Ggdm4/src/MemPool.jl:163
[6] _atexit()
@ Base ./initdefs.jl:372
I try to load
a table that I saved with a previous version of JuliaDB. It contains a column that is of type Int8. I get:
MethodError: no method matching PooledArrays.PooledArray(::PooledArrays.RefArray{Array{UInt8,1}}, ::Array{Int8,1})
[1] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{PooledArrays.PooledArray}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[2] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[3] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}, ::Int64, ::Int64) at .\array.jl:508
[4] collect_to!(::Array{Array{Int32,1},1}, ::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}, ::Int64, ::Int64) at .\array.jl:518
[5] collect(::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}) at .\array.jl:476
[6] mmread(::Type{IndexedTables.Columns}, ::SerializationState{IOStream}, ::Bool) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\serialize.jl:57
[7] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{IndexedTables.Columns}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[8] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[9] mmread(::Type{IndexedTables.NextTable}, ::SerializationState{IOStream}, ::Bool) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\serialize.jl:86
[10] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{IndexedTables.NextTable}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[11] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[12] open(::Base.Serializer.#deserialize, ::String) at .\iostream.jl:152
[13] load(::String) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\io.jl:185
I am using Julia 0.6.2 and:
- MemPool 0.0.11
- JuliaDB 0.8.4
When benchmarking parallel application which uses Dagger, it seems like MemPool.approx_size
is the bottleneck due to it falling back to Base.summarysize
.
Here is a quick MWE:
julia> using BenchmarkTools, DataFrames, MemPool
julia> df = DataFrame(a=1:1000_000, b=randn(1000_000), c=repeat([:aa], 1000_000));
julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial:
memory estimate: 61.03 MiB
allocs estimate: 1999540
--------------
minimum time: 110.895 ms (4.59% GC)
median time: 119.604 ms (2.47% GC)
mean time: 122.978 ms (2.83% GC)
maximum time: 146.009 ms (1.46% GC)
--------------
samples: 41
evals/sample: 1
Here is a sketch of an alternative implementation which is much faster:
julia> function MemPool.approx_size(df::DataFrame)
dsize = mapreduce(MemPool.approx_size, +, eachcol(df))
namesize = mapreduce(MemPool.approx_size, +, names(df))
return dsize + namesize
end
julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial:
memory estimate: 704 bytes
allocs estimate: 13
--------------
minimum time: 535.700 μs (0.00% GC)
median time: 636.800 μs (0.00% GC)
mean time: 664.967 μs (0.00% GC)
maximum time: 1.525 ms (0.00% GC)
--------------
samples: 7499
evals/sample: 1
The above implementation is not 100% correct, but I hope it shows that there is some potential for improvement.
Don't know if there is some interface which can be used to avoid the dependency, e.g. Tables.jl.
0.2.0 works just fine...
julia> JuliaDB.load(loaction);
ERROR: MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
[1] mmread(::Type{Array{String,1}}, ::Serialization.Serializer{IOStream}, ::Bool) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:135
[2] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{Array{String,1}}}) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:29
[3] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:799
[4] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
[5] #30 at ./none:0 [inlined]
[6] iterate at ./generator.jl:47 [inlined]
[7] collect_to!(::Array{Array{Base.UUID,1},1}, ::Base.Generator{UnitRange{Int64},JuliaDB.var"#30#32"{Serialization.Serializer{IOStream}}}, ::Int64, ::Int64) at ./array.jl:711
[8] collect_to_with_first!(::Array{Array{Base.UUID,1},1}, ::Array{Base.UUID,1}, ::Base.Generator{UnitRange{Int64},JuliaDB.var"#30#32"{Serialization.Serializer{IOStream}}}, ::Int64) at ./array.jl:689
[9] collect(::Base.Generator{UnitRange{Int64},JuliaDB.var"#30#32"{Serialization.Serializer{IOStream}}}) at ./array.jl:670
[10] mmread(::Type{StructArrays.StructArray{T,1,C,I} where I where C<:Union{Tuple, NamedTuple} where T}, ::Serialization.Serializer{IOStream}, ::Bool) at /home/user/.julia/packages/JuliaDB/7cG1k/src/serialize.jl:53
[11] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{StructArrays.StructArray{T,1,C,I} where I where C<:Union{Tuple, NamedTuple} where T}}) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:29
[12] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:799
[13] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
[14] mmread(::Type{IndexedTable}, ::Serialization.Serializer{IOStream}, ::Bool) at /home/user/.julia/packages/JuliaDB/7cG1k/src/serialize.jl:84
[15] deserialize(::Serialization.Serializer{IOStream}, ::Type{MemPool.MMSer{IndexedTable}}) at /home/user/.julia/packages/MemPool/ZLU0k/src/io.jl:29
[16] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:799
[17] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
[18] handle_deserialize(::Serialization.Serializer{IOStream}, ::Int32) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:838
[19] deserialize(::Serialization.Serializer{IOStream}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:735
[20] deserialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Serialization/src/Serialization.jl:722 [inlined]
[21] open(::typeof(Serialization.deserialize), ::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:298
[22] open at ./io.jl:296 [inlined]
[23] load(::String; procs::Array{Int64,1}) at /home/user/.julia/packages/JuliaDB/7cG1k/src/io.jl:182
[24] load at /home/user/.julia/packages/JuliaDB/7cG1k/src/io.jl:174 [inlined]
Example situation:
rows(t)
), MemPool thinks that this operation needs to free up 1 GB of space, starts evicting objects to diskThe problem is that MemPool is not accounting for the vectors it writes to disk.
Based on @tanmaykm's suggested fix:
This has a few problems:
The API that poolget
provides, while simple, is unfortunately harmful to swap-to-disk and related use cases, as it does not indicate when the returned data is no longer in use. In a sense, this is by design - Dagger does not know whether values returned by poolget
escape via user code, and how long their lifetime might be, so it cannot reasonably assert that the data is no longer being accessed at some point in time.
However, we should not be forced to live without such guarantees, as it makes any improvements to the storage system questionably beneficial. Thankfully, with the advent of improved escape analysis in Julia's compiler, it should be possible for Dagger to sometimes determine the lifetime of a piece of data, and so communicate this information to MemPool. The form of this will result in at least one new public "unpin", function, which will pair with either poolget
or a new "pin" function, to assert the end of the returned value's lifetime. At the point when memory is fully unpinned, MemPool will be free to take aggressive steps to delete the data and free up memory, instead of just hoping that the GC will do a good job. Of course, not all accesses may have known lifetime, so we'll need to allow such calls to disable pinning and revert back to GC-managed deallocation.
This mechanism can be beneficial for any kind of data which can be explicitly deleted, such as GPU arrays, or another user-defined type with an available memory free function.
Leaving this here as it's the first time I see this - it continued computing the result afterwards, but on one worker only
Super rare
@@@ STARTED: innerjoin_r_unique : 2021-12-30T14:01:24.962
From worker 2: Error in enqueued work:
From worker 2: peer 4 didn't connect to 2 within 59.999990940093994 secondsError in enqueued work:
From worker 2: On worker 4:
From worker 2: KeyError: key (4, 12) not found
From worker 2: Stacktrace:
From worker 2: [1] getindex
From worker 2: @ ./dict.jl:498 [inlined]
From worker 2: [2] canfree
From worker 2: @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:72
From worker 2: [3] #36
From worker 2: @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:209
From worker 2: [4] macro expansion
From worker 2: @ ~/.julia/packages/MemPool/wlrUg/src/lock.jl:42 [inlined]
From worker 2: [5] with_datastore_lock
From worker 2: @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:218
From worker 2: [6] pooltransfer_recv
From worker 2: @ ~/.julia/packages/MemPool/wlrUg/src/datastore.jl:201
From worker 2: [7] #invokelatest#2
From worker 2: @ ./essentials.jl:731
From worker 2: [8] invokelatest
From worker 2: @ ./essentials.jl:729
From worker 2: [9] #120
From worker 2: @ /usr/local/julia/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:294
From worker 2: [10] run_work_thunk
From worker 2: @ /usr/local/julia/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:63
From worker 2: [11] run_work_thunk
From worker 2: @ /usr/local/julia/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:72
From worker 2: [12] #106
Reason is that ConcurrentUtils.jl depends on Try which uses an invalid file name for one of the documentation pages:
After upgrading to 0.0.7 using JuliaDB
yields:
INFO: Recompiling stale cache file C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\lib\v0.6\MemPool.ji for module MemPool.
WARNING: Module Compat with uuid 367821964343373 is missing from the cache.
This may mean module Compat does not support precompilation but is imported by a module that does.
ERROR: LoadError: LoadError: Declaring __precompile__(false) is not allowed in files that are being precompiled.
Stacktrace:
[1] _require(::Symbol) at .\loading.jl:455
[2] require(::Symbol) at .\loading.jl:405
[3] _include_from_serialized(::String) at .\loading.jl:157
[4] _require_from_serialized(::Int64, ::Symbol, ::String, ::Bool) at .\loading.jl:200
[5] _require_search_from_serialized(::Int64, ::Symbol, ::String, ::Bool) at .\loading.jl:236
[6] _require(::Symbol) at .\loading.jl:441
[7] require(::Symbol) at .\loading.jl:405
[8] include_from_node1(::String) at .\loading.jl:576
[9] include(::String) at .\sysimg.jl:14
[10] include_from_node1(::String) at .\loading.jl:576
[11] include(::String) at .\sysimg.jl:14
[12] anonymous at .\<missing>:2
while loading C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\datastore.jl, in expression starting on line 235
while loading C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\MemPool.jl, in expression starting on line 50
At the moment, our storage layer (in StorageState
) is non-transactional, and so it's not possible to ensure that a certain set of operations occur without other (incompatible or performance-degrading) operations occurring in-between. This can make it impossible for the SimpleRecencyAllocator
(which provides our swap-to-disk functionality) to be able to provide any guarantees that memory/disk usage actually falls within user-provided limits.
The obvious solution here is to add "transactions", where the set of operations to be performed in sequence are batched up and submitted as a single logical operation. Any transaction in progress can exclude other operations or transactions from occurring on the same piece(s) of data until the current transaction completes. Additionally, for performance, we can implement some logic to "merge" compatible operations across multiple transactions (or within the same transaction) when the end result will be identical to running the operations exclusively.
I am not sure whether this issue belongs here. On Julia 1.0.3, JuliaDB 0.10, and MemPool 0.1.2 (or master) I saved a table:
Table with 429003 rows, 7 columns:
Columns:
# colname type
──────────────────────────────────────
1 name1 String
2 name2 String
3 id1 Int64
4 id2 Int64
5 date1 Union{Missing, Date}
6 date2 Union{Missing, Date}
7 date3 Union{Missing, Date}
and when I load it back in Julia crashes with:
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7fff76cf4300 -- memmove at C:\WINDOWS\System32\msvcrt.dll (unknown line)
in expression starting at no file:0
memmove at C:\WINDOWS\System32\msvcrt.dll (unknown line)
jl_pchar_to_string at /home/Administrator/buildbot/worker/package_win64/build/src\array.c:472
unsafe_string at .\strings\string.jl:53
unknown function (ip: 000000001276EBB0)
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
mmread at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:135
deserialize at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:27
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711 [inlined]
#30 at .\none:0 [inlined]
iterate at .\generator.jl:47 [inlined]
collect at .\array.jl:619
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
mmread at C:\Users\Max\.julia\packages\JuliaDB\R4e6y\src\serialize.jl:53
deserialize at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:27
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
mmread at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
deserialize at C:\Users\Max\.julia\packages\MemPool\Bw8DR\src\io.jl:27
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:762
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
handle_deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:801
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:711
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184 [inlined]
jl_apply at /home/Administrator/buildbot/worker/package_win64/build/src\julia.h:1537 [inlined]
jl_invoke at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:56
deserialize at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Serialization\src\Serialization.jl:708 [inlined]
#open#294 at .\iostream.jl:369
jl_fptr_trampoline at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:1831
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184 [inlined]
jl_apply at /home/Administrator/buildbot/worker/package_win64/build/src\julia.h:1537 [inlined]
jl_invoke at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:56
open at .\iostream.jl:367 [inlined]
load at C:\Users\Max\.julia\packages\JuliaDB\R4e6y\src\io.jl:181
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
do_call at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:324
eval_value at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:430
eval_stmt_value at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:363 [inlined]
eval_body at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:678
jl_interpret_toplevel_thunk_callback at /home/Administrator/buildbot/worker/package_win64/build/src\interpreter.c:806
unknown function (ip: FFFFFFFFFFFFFFFE)
unknown function (ip: 0000000008BB720F)
unknown function (ip: FFFFFFFFFFFFFFFF)
jl_toplevel_eval_flex at /home/Administrator/buildbot/worker/package_win64/build/src\toplevel.c:805
jl_toplevel_eval_in at /home/Administrator/buildbot/worker/package_win64/build/src\builtins.c:622
eval at .\boot.jl:319
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
eval_user_input at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:85
macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\REPL\src\REPL.jl:117 [inlined]
#28 at .\task.jl:259
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2184
jl_apply at /home/Administrator/buildbot/worker/package_win64/build/src\julia.h:1537 [inlined]
start_task at /home/Administrator/buildbot/worker/package_win64/build/src\task.c:268
Allocations: 38776166 (Pool: 38767273; Big: 8893); GC: 84
Line 168 in 8508088
In exit_hook
, the default_dir()
is removed. What about if another directory is used, e.g., by passing diskpath
to DiskCacheConfig
, which in turn is passed to setup_global_device!
? Does that custom diskpath
remain, i.e., need to be manually deleted? And is this a bug or a feature?
Line 963 in 8508088
The call to reverse
above assumes to_delete
is in ascending order, which isn't necessarily the case. I didn't thoroughly test, so I don't know what combinations of to_mem
and sra.policy
cause an error, but I got an error about trying to delete an index that didn't exist when using a LRU policy (don't know if to_mem
was true or not). The problem did not occur when I switched to a MRU policy. Sorry I don't have a MWE, though I can reliably reproduce the issue.
My problem went away when I replaced reverse(to_delete)
with sort(to_delete; rev = true)
.
It will be useful to be able to associate locality to a FileRef
. Locality can be node level, one or more if present.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.