Comments (7)
Here's a breakdown of what's happening inside Program.build
:
│ ├─ 11.333 Program.build pyopencl/__init__.py:505
│ │ └─ 11.333 Program._build_and_catch_errors pyopencl/__init__.py:554
│ │ └─ 11.333 <lambda> pyopencl/__init__.py:536
│ │ └─ 11.333 create_built_program_from_source_cached pyopencl/cache.py:489
│ │ └─ 11.333 _create_built_program_from_source_cached pyopencl/cache.py:341
│ │ ├─ 11.186 PyCapsule.get_info <built-in>
│ │ ├─ 0.145 _Program.program_build pyopencl/__init__.py:735
│ │ │ └─ 0.145 PyCapsule._build <built-in>
│ │ └─ 0.001 retrieve_from_cache pyopencl/cache.py:265
│ │ └─ 0.001 isdir <frozen genericpath>:39
│ │ └─ 0.001 stat <built-in>
The slowdown appears to be coming from these calls. Timing the two separately, it looks like the second one specifically is to blame.
from pyopencl.
I think those get_info
calls just trigger the actual build downstream (ie., pocl). Do they not show up in the uncached build (maybe in a different spot)?
from pyopencl.
Based on @matthiasdiener's comment and our discussion this morning, I made some more measurements, this time on the whole compile time. Specifically, I compared the first step time of grudge wave for:
- Caching enabled (i.e., not setting
PYOPENCL_NO_CACHE
). This is the path that callscreate_built_program_from_source_cached
and reads/writes cache. Note: For this test I disabled cache reading to simulate a completely cold cache (and eliminate cache reads resulting from cache writes in the same execution, which somehow does seem to happen). - Caching disabled (setting
PYOPENCL_NO_CACHE=1
). This path just callsprg.build(...)
directly.
If I understand correctly, the main time difference between these should come down to the cache writing time. Here's what I see (same setup as before, with rank-local cache dirs; also, I am manually applying the changes from #716, which don't seem to have made it to the version on conda yet):
The scaling is not good, but could be due to DAG splat. Additionally, it seems as if the cache writing is taking a lot of time. However, if I add a (unused) call to get_info(BINARIES)
in the non-cache version I see this:
which suggests that most of the time is coming from the get_info
call, not the actual cache writing. Does this make sense? Is get_info(BINARIES)
doing something inefficient?
from pyopencl.
Is
get_info(BINARIES)
doing something inefficient?
It sure looks that way. It might require duplicate compilation in pocl? (I'm not sure where, but your second graph is enough for me.) Based on this, I think we should definitely turn off pyopencl's CL binary caching for pocl. PR?
It might also be worthwhile to understand what pocl is doing under the hood.
from pyopencl.
I think what happens is the following:
Example pyopencl code:
import numpy as np import pyopencl as cl import pyopencl.array as cl_array rng = np.random.default_rng() a = rng.random(50000, dtype=np.float32) b = rng.random(50000, dtype=np.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) a_dev = cl_array.to_device(queue, a) b_dev = cl_array.to_device(queue, b) dest_dev = cl_array.empty_like(a_dev) prg = cl.Program(ctx, """ __kernel void sum(__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; """ + "c[gid] = a[gid] + b[gid];"*1000 + "}" ).build() knl = prg.sum # Use this Kernel object for repeated calls knl(queue, a.shape, None, a_dev.data, b_dev.data, dest_dev.data) assert np.allclose(dest_dev.get(), a + b)
- Without
get_info(BINARIES)
(i.e., cache disabled): only 1 kernel gets compiled by pocl (viapocl_llvm_codegen
):./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/2000-1-1-goffs0-smallgrid/sum.so
- When
get_info(BINARIES)
(i.e., cache enabled) is called, it also compiles a generic version viapocl_driver_build_poclbinary
./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/2000-1-1-goffs0-smallgrid/sum.so ./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/0-0-0/sum.so
I haven't found a way to disable this behavior.
from pyopencl.
Thanks for doing more digging here, @matthiasdiener! While we didn't decode that a "generic" kernel was being built, we did track down pocl_driver_build_poclbinary
and concluded that it would likely trigger a compile and that, given @majosm's measurements, that compile was in addition to the "normal" from-source-for-execution build.
Important question: are all these conclusions still valid for the Nvidia target? They seem device-unspecific, but I don't know how a generic kernel would be different from a size-specific one in the GPU case.
At any rate, at least for CPU, we can probably save time by skipping pyopencl's binary cache if we're working with pocl.
from pyopencl.
Seems like the time spent in get_info(BINARIES)
is much higher for CPUs than it is for GPUs. For combozzle on Lassen I'm seeing sub-millisecond times when running on the GPU, and up to 40s when running on the CPU.
from pyopencl.
Related Issues (20)
- Windows Intel CL Github CI fails HOT 2
- Fails to build with OpenCL 3.0 headers HOT 3
- UHD Graphics 600 | Calling kernel + enqueue_copy more than once, results in OUT_OF_RESOURCES error or freeze HOT 4
- Compatibility with `numpy2` HOT 3
- move away from deprecated appdirs
- Pocl Mac crashing again HOT 4
- `CMakeLists.txt` file missing in PyPI source dist HOT 2
- `pip install .` causes messy directory layout HOT 3
- Non-empty compiler output HOT 9
- Nanobind leak warnings HOT 9
- Kernel evaluation throws exception for SVM objects HOT 1
- pyopencl.compyte missing in pyopencl-2024.2.4.tar.gz HOT 3
- 2024.2.5 is Broken HOT 5
- Wheels built with numpy2 HOT 3
- Invoker cache created and written to even when PYOPENCL_NO_CACHE=1 HOT 2
- Can't invoke builtin kernels with a period in their name HOT 5
- Difference in behavior between PYOPENCL_CTX and PYOPENCL_TEST HOT 1
- More ergonomic way to access builtin kernels with weird names. HOT 4
- How to run tests when distributing pyopencl? HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyopencl.