Comments (6)
There's a little explanation for the current behavior in the first paragraph of this section. To be a little more explicit, I settled on this design as a way to solve a tricky interaction between Julia's GC and CUDA:
function foo()
p = CUDArt.malloc(Float32, 8) # let's say this points to 0x1234
# Do some stuff with p
end
foo()
device_reset()
# Let's say p is still hanging around on the julia side, even though the memory has been freed on the CUDA side
device()
x = CUDArt.malloc(Float32, 8) # this might also point to 0x1234---CUDA can reassign the same pointer across sessions
gc() # oops, x just got deleted because p got garbage-collected
fill!(x, 4) # segfault
from cudart.jl.
There's an explicit test for this behavior here: https://github.com/JuliaGPU/CUDArt.jl/blob/35e9e4922eba7ed4bd3154b520e6716f345195c1/test/gc.jl#L43-L62
from cudart.jl.
Hmm, I wonder if there might be another way to solve the same problem: perhaps each "session" could start a unique version number (stored as some global), and device_reset
increments the version number. The free
function for a pointer could check that the current version number matches the version number at the time of allocation---if it doesn't, don't free the memory.
Thoughts? An alternative is to figure out which calls in device_reset
sometimes fail, and wrap them in a try/catch.
from cudart.jl.
There's a little explanation for the current behavior in the first paragraph of this section. To be a little more explicit, I settled on this design as a way to solve a tricky interaction between Julia's GC and CUDA:
Hm, that's tricky issue, I hadn't thought of that. Given that you never know when a device might be reset (in another thread, even?), I think this means that finalizers can't be made to work correctly.
Hmm, I wonder if there might be another way to solve the same problem: perhaps each "session" could start a unique version number (stored as some global), and device_reset increments the version number. The free function for a pointer could check that the current version number matches the version number at the time of allocation---if it doesn't, don't free the memory.
This sounds complicated, it's like trying to keep track of the device context object inside libcudart. How would this handle multiple threads doing this? Can you guarantee that it wouldn't go wrong? I would like to propose something else.
Garbage collection is non-deterministic, so you can't really rely on finalizers running exactly at the end of a "compute region", which starts with initialization with cudaSetDevice and ends with cudaDeviceReset. But everything that is associated with the device context in between those two calls is only valid until the matching cudaDeviceReset, which can be called almost anywhere, even in another thread.
My proposal:
- No finalizers at all. I don't see how to make sure that the device context is the same one and the pointer is still valid.
- No global dictionary of allocated pointers. Kind of "hacky", and still not robust enough to be guaranteed to work.
cudaDeviceReset
promises
(see cudaDeviceReset)
that it
Explicitly destroys and cleans up all resources associated with the current device in the current process.
-
When device_reset is called, it should call cudaDeviceReset, ignoring any errors (which must have come from an earlier launch because cudaDeviceReset promises to return cudaSuccess). This cleans up all resources.
-
If a library user keeps a pointer after a device reset, so the pointer is invalidated, that is a user bug, not a library bug. The user should be aware that gpu memory is scarce anyway, so should pay attention with and without gc.
-
To facilitate temporary allocations with a known scope make a new version of malloc (with variants for arrays, pitched arrays) like this:
CudaPitchedArray(Float32, 100, 100) do arr
...
end
where CudaArray(::Function, ...)
allocates an array, runs the function, then frees. This is guaranteed to make the array scope completely correct for most common uses. Plenty of other languages like Haskell, Python (with
syntax) do it like this for scarce non-memory resources, like file handles.
Otherwise the user can free!
the array themselves, or let it live until device reset.
Thoughts? An alternative is to figure out which calls in device_reset sometimes fail, and wrap them in a try/catch.
I think it's kind of brittle. Also, I'm not very happy with a wrapper library that occasionally passes weird invalid inputs for no good reason. You would also have to reset cudaGetLastError
, so that the user doesn't see any of the failures.
from cudart.jl.
To facilitate temporary allocations with a known scope make a new version of malloc (with variants for arrays, pitched arrays) like this:
CudaPitchedArray(Float32, 100, 100) do arr
...
end
It's an interesting idea, and extends a pattern that is obviously already used heavily within CUDArt for initialization of devices and modules. I worry a little bit about the performance implications, since julia's anonymous functions are not speedy. But it certainly would be a simple and robust approach from the standpoint of the package. (Less convenient for the user, of course.)
Keep in mind that the current system runs the finalizers either upon gc()
or upon device_reset()
. To me that doesn't seem so terribly fragile. I guess I hadn't considered what would happen if the user calls device_reset()
from another thread. But I also think that would break your solution too, right? (If that happened in the middle of executing that do
block.)
In other words, I agree that context errors are nasty and need to be stomped out. But I'm a little reluctant to break the more "julian" behavior unless it's abundantly clear that the current system can't work. I'd be more convinced if, for example, I could replicate #18 and knew there wasn't an easy way to fix it. (Of course, other demonstrations of the problem would also be fine.)
from cudart.jl.
Okay, here is an example where CUDArt is not reset correctly. I think this can be addressed by ignoring all errors in the finalizers in device_reset, because cudaFree and other such functions can return error codes from previous asynchronous functions.
function test4()
try
@show CUDArt.cuda_ptrs
@show CUDArt.ptxdict
devices(dev->true, nmax=1) do devlist
device(devlist[1])
md = CuModule("test.ptx", false)
func = CuFunction(md, "test_kernel")
try
x = [CudaPitchedArray(Float32, 100, 100) for k in 1:10]
launch(func, 1, 1, (x[1],))
finally
unload(md)
end
end
finally
@show CUDArt.cuda_ptrs
@show CUDArt.ptxdict
end
end
This will fail:
extern "C"
__global__
void test_kernel() {
float* ptr = 0;
*ptr = 1.f;
}
Outcome:
julia> Test.test4()
CUDArt.cuda_ptrs = Dict{Any,Int64}()
CUDArt.ptxdict = Dict{Any,Any}()
CUDArt.cuda_ptrs = Dict{Any,Int64}(Ptr{Void} @0x0000004200170800=>0,Ptr{Void} @0x000000420010c800=>0,Ptr{Void} @0x0000004200132000=>0,Ptr{Void} @0x000000420014b000=>0,Ptr{Void} @0x0000004200125800=>0,Ptr{Void} @0x000000420013e800=>0,Ptr{Void} @0x0000004200100000=>0,Ptr{Void} @0x0000004200164000=>0,Ptr{Void} @0x0000004200119000=>0,Ptr{Void} @0x0000004200157800=>0)
CUDArt.ptxdict = Dict{Any,Any}((0,"fill_pitched",UInt32)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007291b00),(0,"fill_pitched",Int8)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007535120),(0,"fill_contiguous",Int16)=>CUDArt.CuFunction(Ptr{Void} @0x000000000752ddc0),(0,"fill_contiguous",UInt64)=>CUDArt.CuFunction(Ptr{Void} @0x00000000074fedf0),(0,"fill_contiguous",UInt16)=>CUDArt.CuFunction(Ptr{Void} @0x00000000045e61c0),(0,"fill_contiguous",Int64)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007532c10),(0,"fill_pitched",UInt16)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007528b90),(0,"clock_block")=>CUDArt.CuFunction(Ptr{Void} @0x0000000007543b80),(0,"fill_pitched",UInt8)=>CUDArt.CuFunction(Ptr{Void} @0x000000000751ea90),(0,"fill_contiguous",Int8)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007537a60),(0,"fill_pitched",Int64)=>CUDArt.CuFunction(Ptr{Void} @0x000000000753eda0),(0,"fill_pitched",Int32)=>CUDArt.CuFunction(Ptr{Void} @0x000000000752b4d0),(0,"fill_contiguous",UInt32)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007522e20),(0,"fill_pitched",Float64)=>CUDArt.CuFunction(Ptr{Void} @0x00000000075302b0),(0,"fill_pitched",Int16)=>CUDArt.CuFunction(Ptr{Void} @0x000000000753c460),(0,"fill_contiguous",UInt8)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007524cd0),(0,"fill_pitched",Float32)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007404420),(0,"fill_contiguous",Int32)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007541690),(0,"fill_contiguous",Float64)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007539ee0),(0,"fill_contiguous",Float32)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007526b80),(0,"fill_pitched",UInt64)=>CUDArt.CuFunction(Ptr{Void} @0x0000000007520b30))
ERROR: Kernel launch failed
in checkdrv at /***/.julia/v0.4/CUDArt/src/module.jl:6
in close! at /***/.julia/v0.4/CUDArt/src/device.jl:93
in devices at /***/.julia/v0.4/CUDArt/src/device.jl:59
in devices at /***/.julia/v0.4/CUDArt/src/device.jl:49
in test4 at /***/test.jl:94
from cudart.jl.
Related Issues (20)
- Tests fail on Windows with 0.6 HOT 1
- Info about upcoming removal of packages in the General registry
- Support for ptx modules with external functions HOT 2
- Does CUDArt support cuda 8.0? HOT 1
- triggering gc based on gpu memory
- CUDArt assumptions not robust
- Precompile Error HOT 1
- Intermittent GC-related test failure (`isempty(cuda_ptrs)`) HOT 2
- New tag HOT 2
- Updated build script for visual studio 17 but get compile errors HOT 2
- error could not load library "libnvidia-ml" HOT 4
- Makefile needs to select correct gcc compiler HOT 1
- Rename types to CuArray, CuMatrix and so forth for consistency with CUDAdrv?
- CUDArt should not rely on `nvidia-smi` or `nvml` on Mac OSX HOT 31
- CUDArt fails to build when no CUDA device is present
- gcc5.4.0 support HOT 1
- OOB during package build HOT 7
- No method matching reset(::Cudadrv.CuPrimaryContext) HOT 3
- GCC Version On CUDA 8.0 HOT 3
- Unified Memory support HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudart.jl.