Low GPU usage on Windows,about mmp/pbrt-v4

Comments (84)

pierremoreau commented on June 28, 2024 1

Indeed… currently generating a trace with Nsight Systems to have a close look.

from pbrt-v4.

mmp commented on June 28, 2024 1

Step by step! I've pushed that fix to both branches.

from pbrt-v4.

pierremoreau commented on June 28, 2024 1

(BTW I have merged the windows-gpu-rework branch to master (with some improvements to work around the performance hit it previously introduced on Linux.) Let me know if I messed up anything on the WIndows front..

I tested the latest master on Windows, and it is still working fine and am even seeing a 10–15% improvement over the windows-gpu-rework branch: I rendered the scene 3 times and ended up with results between 14.4 and 14.8 seconds, compared to the 16–17 seconds I reported yesterday on the windows-gpu-rework branch. 🎉

from pbrt-v4.

mmp commented on June 28, 2024

Here's the corresponding log on Linux on a RTX 2080

GPU Kernel Profile:
  Generate Camera rays                               1024 launches    703.02 ms /   5.3% (avg  0.687, min  0.657, max   0.946)
  Generate ray samples - HaltonSampler               6144 launches    626.28 ms /   4.7% (avg  0.102, min  0.019, max   0.560)
  Reset queues before tracing rays                   6144 launches     33.01 ms /   0.2% (avg  0.005, min  0.004, max   0.008)
  Tracing closest hit rays                           6144 launches   4854.99 ms /  36.3% (avg  0.790, min  0.062, max  12.976)
  Handle emitters hit by indirect rays               6144 launches    148.44 ms /   1.1% (avg  0.024, min  0.016, max   0.047)
  ConductorMaterial + BxDF Eval (Basic tex)          5120 launches   1172.95 ms /   8.8% (avg  0.229, min  0.041, max   1.098)
  DiffuseMaterial + BxDF Eval (Basic tex)            5120 launches   4633.22 ms /  34.6% (avg  0.905, min  0.035, max   2.640)
  Tracing shadow rays                                5120 launches    483.30 ms /   3.6% (avg  0.094, min  0.038, max   0.380)
  Incorporate shadow ray contribution                5120 launches    194.88 ms /   1.5% (avg  0.038, min  0.016, max   0.071)
  Reset shadowRayQueue                               5120 launches     27.63 ms /   0.2% (avg  0.005, min  0.004, max   0.007)
  Handle medium transitions                          5120 launches     84.99 ms /   0.6% (avg  0.017, min  0.015, max   0.023)
  Update indirect ray stats                          5120 launches     28.26 ms /   0.2% (avg  0.006, min  0.003, max   0.008)
  Update Film                                        1024 launches    369.82 ms /   2.8% (avg  0.361, min  0.357, max   0.373)
  Other                                              2048 launches     10.82 ms /   0.1% (avg  0.005)

Total GPU time:  13371.61 ms

GPU Statistics:
    Camera rays                                                 718626816
    Indirect rays, depth 1                                      718039415
    Indirect rays, depth 2                                      411877638
    Indirect rays, depth 3                                       24306403
    Indirect rays, depth 4                                        7197480
    Indirect rays, depth 5                                        2514180
    Shadow rays, depth 0                                        359206403
    Shadow rays, depth 1                                        195455999
    Shadow rays, depth 2                                        107390027
    Shadow rays, depth 3                                          6470333
    Shadow rays, depth 4                                          2033811
[Other stats are all exactly the same.]

As a percentage of runtime, the kernels are all in the same range. And the number of rays traced is about the same.

It is interesting that the reported kernel times are much higher. i.e., if we were an issue that the CPU isn't submitting enough work to keep the GPU busy and the GPU was idling, then presumably those times (which are measured by CUDA events before and after each launch), would still be low, but overall run time would be high.

from pbrt-v4.

mmp commented on June 28, 2024

...and the issue doesn't seem to be that Windows is somehow running debug kernels. Here's linux with a debug build. (Just 8spp, so just compare the "avg" time for each kernel)--Windows is still 5-10x slower.

GPU Kernel Profile:
  Reset ray queue                                      16 launches     12.07 ms /   0.2% (avg  0.754, min  0.014, max  11.860)
  Generate Camera rays                                 16 launches    674.56 ms /   8.9% (avg 42.160, min 41.314, max  51.347)
  Generate ray samples - HaltonSampler                 96 launches    804.32 ms /  10.6% (avg  8.378, min  1.038, max  19.611)
  Reset queues before tracing rays                     96 launches     14.85 ms /   0.2% (avg  0.155, min  0.153, max   0.221)
  Tracing closest hit rays                             96 launches    128.72 ms /   1.7% (avg  1.341, min  0.173, max   3.507)
  Handle emitters hit by indirect rays                 96 launches    232.05 ms /   3.0% (avg  2.417, min  0.558, max   3.655)
  ConductorMaterial + BxDF Eval (Basic tex)            80 launches   1351.06 ms /  17.7% (avg 16.888, min  4.613, max  85.159)
  DiffuseMaterial + BxDF Eval (Basic tex)              80 launches   3985.23 ms /  52.3% (avg 49.815, min  4.581, max 106.935)
  Tracing shadow rays                                  80 launches     12.33 ms /   0.2% (avg  0.154, min  0.071, max   0.228)
  Incorporate shadow ray contribution                  80 launches    161.64 ms /   2.1% (avg  2.021, min  0.696, max   4.259)
  Handle medium transitions                            80 launches     56.19 ms /   0.7% (avg  0.702, min  0.693, max   0.712)
  Update Film                                          16 launches    183.48 ms /   2.4% (avg 11.467, min 11.417, max  11.580)
  Other                                               176 launches      3.51 ms /   0.0% (avg  0.020)

from pbrt-v4.

pierremoreau commented on June 28, 2024

Here is a zip'ed report from Nsight Systems (I hope others can open it just fine); I let it run about 20 secs, which seems to have been enough to do 2.5 frames (by looking at the repeating patterns in the trace)?

Report 11.zip

from pbrt-v4.

pierremoreau commented on June 28, 2024

Some of the long lasting kernels have the following in common: several hundred MB of local memory total (though none on a per thread basis), for example:

Local Memory Per Thread: 0 bytes
Local Memory Total: 441,188,352 bytes

I'll check with Nsight Compute but it looks weird how those kernels have relatively few threads in total (for example, the one cited above for the local memory has a grid size of (914, 1, 1) and a block size of (768, 1, 1) (and 80 regs per thread).

from pbrt-v4.

mmp commented on June 28, 2024

Success opening it!

One thing that's not present in linux traces is the progress reporter stuff. What happens if you comment out the ProgressReporter declaration around line 350 of pathintegrator.cpp and then the calls to progress.Update() and progress.Done()?

I will post a linux trace for comparison in a sec (my first try was too big.)

from pbrt-v4.

pierremoreau commented on June 28, 2024

(my first try was too big.)

My first tries were way too big and never finished. I learned (after like 5+ tries) that I should start the profiling manually after several seconds and not automatically at the beginning, otherwise it just ended with too many samples. Okay, I'll generate a new trace without the ProgressReporter.

from pbrt-v4.

mmp commented on June 28, 2024

Linux trace of 2s of a run.
report.zip

from pbrt-v4.

mmp commented on June 28, 2024

(It looks like you still have cudaDeviceSynchronize calls in there, but I'm not sure if it matters.)

from pbrt-v4.

pierremoreau commented on June 28, 2024

I was still on the master branch

from pbrt-v4.

mmp commented on June 28, 2024

So I think this sums up the symptoms, though the cause isn't yet clear to me...

The top grey bar is the total time charged to a "Incorporate shadow ray contribution." It's long. In the lower part, we can see that the GPU is basically idle, except for a short time when it is actually doing work.

There are two weird things: first the _ZN4pbrt... thing, in blue, is taking a long time before it launches the kernel. Then, that cudaDeviceSynchronize() stalls for a long time. If you zoom in farther to the point where it clears, there's this:

Which suggests it's stalled on TaskRunner(?!)

from pbrt-v4.

pierremoreau commented on June 28, 2024

I seem to get a 2x speedup with windows-gpu-rework as it now estimates a total time of about 3600 seconds. Now let's comment out the ProgressReporter on top of that.

from pbrt-v4.

mmp commented on June 28, 2024

Nice! Fingers crossed.

from pbrt-v4.

mmp commented on June 28, 2024

(And the ProgressReporter thing may well not make a difference--it is expected that thread will sleep, periodically wait on a CUDA event to see if it's cleared, and then print some stuff. So it's expected to be blocked a lot. But if for some reason it's interfering with the main command stream...)

from pbrt-v4.

pierremoreau commented on June 28, 2024

Looking a lot tighter! I will make one with the ProgressReporter enabled again but still on the new branch.

Report 12.zip

from pbrt-v4.

pierremoreau commented on June 28, 2024

The ProgressReporter does not seem to make any difference regarding timings.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Did you guys see my question about Windows emulating shared memory by copying large blocks of memory back and forth. Can you tell from the trace if that is happening?

from pbrt-v4.

pierremoreau commented on June 28, 2024

I saw the question, and no idea though I would expect those copies to pop up in the trace.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

We seem to be spending an inordinate amount of time in GenerateCameraRays. Is that expected?

from pbrt-v4.

mmp commented on June 28, 2024

Catching up on a few things..

@pierremoreau those block and grid sizes are fine, I believe: 914 grids * 768 threads works out to about half the image resolution, which makes sense, since it renders in two chunks of scanline.

@richardmgoodin I don't think it's memory copies--those show up separately in the traces and seem to be <0.1% of the total, which is to be expected (basically just enough to copy parameters over to optix when it is kicked off.) (Under CUDA() / stream 7 and CUDA / stream 15 in Pierre's latest trace. One of those is Optix and one is all the rest of pbrt's kernels.) I was a little surprised by that--I had guessed that was going to be the problem..

It is definitely a ton of time in GenerateCameraRays, but all of the other kernels are seeing a similar proportional increase in time spent in them, so I don't think it's just that.

Will keep poking at Pierre's latest trace...

from pbrt-v4.

mmp commented on June 28, 2024

A lot of time seems to be spent waiting in GetProfilerEvents() (but that, too, may be expected..) The basic idea is that there is a small pool of 1k cudaEvent_t structs to put before/after each kernel launch for profiling. When it runs out of them, it waits for some to free up, due to earlier kernels freeing up. Normally, this shouldn't slow down the GPU, since there should be plenty of work buffered up, but, it's unclear what's happening here.

Could someone try changing the 1024 passed to the resize() call at line 64 of GetProfilerEvents() in launch.cpp to 1000000 and see if anything changes? That should be enough that it will never need to wait for them when rendering killeroos-gold.

(I don't have a lot of confidence in this theory, since the traces seem to show that the kernels are filling the machine, just running really slow, but it's worth a shot..)

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I’ll give it a shot… Rich

…

On Aug 30, 2020, at 7:56 PM, Matt Pharr ***@***.***> wrote: A lot of time seems to be spent waiting in GetProfilerEvents() (but that, too, may be expected..) The basic idea is that there is a small pool of 1k cudaEvent_t structs to put before/after each kernel launch for profiling. When it runs out of them, it waits for some to free up, due to earlier kernels freeing up. Normally, this shouldn't slow down the GPU, since there should be plenty of work buffered up, but, it's unclear what's happening here. Could someone try changing the 1024 passed to the resize() call at line 64 of GetProfilerEvents() in launch.cpp to 1000000 and see if anything changes? That should be enough that it will never need to wait for them when rendering killeroos-gold. (I don't have a lot of confidence in this theory, since the traces seem to show that the kernels are filling the machine, just running really slow, but it's worth a shot..) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADB45JM5E4POPQCBQRO3S53SDLRKFANCNFSM4QPXW65Q>.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I’m seeing about 500s on killeroo-gold. What were you guys seeing earlier? Rich

…

from pbrt-v4.

mmp commented on June 28, 2024

Drat. 12.7s on linux.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I keep asking myself what is different. The kernels are compiled identically with nvcc. We have two different pieces of hardware that are seeing about the same slowdown. We are not seeing slowdowns in memory traffic. What can the driver configure differently? Is a cache turned off somewhere? I am seeing a lot more compile warnings in Windows than I saw today when building for Linux which was by and large clean. I can't imagine any compiler differences would account for things since it is on the CPU side and it it appears that our kernels are just taking quite a bit longer to run.

from pbrt-v4.

pierremoreau commented on June 28, 2024

Tested again on the windows-gpu-rework branch with eventPool.resize(1000000) and the ProgressReporter commented out, and still taking about 3,400–3,500 seconds.

from pbrt-v4.

pierremoreau commented on June 28, 2024

Test the windows-gpu-rework branch on the Linux boot, and I get an almost 10x speed-up (it took 417 seconds to render) compared to on Windows:

GPU Kernel Profile:
  Generate Camera rays                               1024 launches  34539.33 ms /   8.3% (avg 33.730, min 33.382, max  37.200)
  Generate ray samples - HaltonSampler               6144 launches  58965.61 ms /  14.1% (avg  9.597, min  3.464, max  18.651)
  Reset queues before tracing rays                   6144 launches   1317.25 ms /   0.3% (avg  0.214, min  0.206, max   0.370)
  Tracing closest hit rays                           6144 launches   5049.73 ms /   1.2% (avg  0.822, min  0.068, max  15.452)
  Handle emitters hit by indirect rays               6144 launches  21629.62 ms /   5.2% (avg  3.520, min  3.262, max   4.305)
  ConductorMaterial + BxDF Eval (Basic tex)          5120 launches  65866.45 ms /  15.8% (avg 12.865, min  4.630, max  60.426)
  DiffuseMaterial + BxDF Eval (Basic tex)            5120 launches 173395.52 ms /  41.5% (avg 33.866, min  4.435, max  74.719)
  Tracing shadow rays                                5120 launches    488.70 ms /   0.1% (avg  0.095, min  0.042, max   0.163)
  Incorporate shadow ray contribution                5120 launches  24207.30 ms /   5.8% (avg  4.728, min  3.393, max   7.132)
  Reset shadowRayQueue                               5120 launches    430.44 ms /   0.1% (avg  0.084, min  0.080, max   0.115)
  Handle medium transitions                          5120 launches  17618.89 ms /   4.2% (avg  3.441, min  3.401, max   3.728)
  Update Film                                        1024 launches  13703.16 ms /   3.3% (avg 13.382, min 13.259, max  13.697)
  Other                                              7168 launches    564.33 ms /   0.1% (avg  0.079)

Total GPU time: 417776.31 ms

GPU Statistics:
    Camera rays                                                 718626816
    Indirect rays, depth 1                                      718039417
    Indirect rays, depth 2                                      411877673
    Indirect rays, depth 3                                       24306355
    Indirect rays, depth 4                                        7197276
    Indirect rays, depth 5                                        2514007
    Shadow rays, depth 0                                        359206412
    Shadow rays, depth 1                                        195456010
    Shadow rays, depth 2                                        107390074
    Shadow rays, depth 3                                          6470485
    Shadow rays, depth 4                                          2033699

And on the master branch:

GPU Kernel Profile:
  Generate Camera rays                               1024 launches  29981.08 ms /   8.9% (avg 29.278, min 29.043, max  33.308)
  Generate ray samples - HaltonSampler               6144 launches  42534.34 ms /  12.6% (avg  6.923, min  0.940, max  15.806)
  Reset queues before tracing rays                   6144 launches    937.46 ms /   0.3% (avg  0.153, min  0.147, max   0.221)
  Tracing closest hit rays                           6144 launches   5157.17 ms /   1.5% (avg  0.839, min  0.070, max  19.147)
  Handle emitters hit by indirect rays               6144 launches  10640.78 ms /   3.1% (avg  1.732, min  0.567, max   3.028)
  ConductorMaterial + BxDF Eval (Basic tex)          5120 launches  58128.32 ms /  17.2% (avg 11.353, min  3.093, max  60.935)
  DiffuseMaterial + BxDF Eval (Basic tex)            5120 launches 165481.17 ms /  48.9% (avg 32.321, min  2.957, max  72.519)
  Tracing shadow rays                                5120 launches    487.04 ms /   0.1% (avg  0.095, min  0.043, max   0.168)
  Incorporate shadow ray contribution                5120 launches  10385.46 ms /   3.1% (avg  2.028, min  0.699, max   4.444)
  Handle medium transitions                          5120 launches   3667.10 ms /   1.1% (avg  0.716, min  0.700, max   0.968)
  Update Film                                        1024 launches  10805.38 ms /   3.2% (avg 10.552, min 10.447, max  10.819)
  Other                                             12288 launches    233.18 ms /   0.1% (avg  0.019)

Total GPU time: 338438.50 ms

GPU Statistics:
    Camera rays                                                 718626816
    Indirect rays, depth 1                                      718039417
    Indirect rays, depth 2                                      411877673
    Indirect rays, depth 3                                       24306355
    Indirect rays, depth 4                                        7197276
    Indirect rays, depth 5                                        2514007
    Shadow rays, depth 0                                        359206412
    Shadow rays, depth 1                                        195456010
    Shadow rays, depth 2                                        107390074
    Shadow rays, depth 3                                          6470485
    Shadow rays, depth 4                                          2033699

Still a 20x difference compared to your results. I am very confused how your 2080 is pulling out a 20x on my 2080 Ti. I wonder if it's a CPU or storage difference.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I was just looking at the build log under windows. Should nvcc be generating code for

-gencode=arch=compute_52,code="sm_52,compute_52"

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code="sm_52,compute_52" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -v -g -D_MBCS -Xcompiler "/EHsc /W0 /nologo /Od /FdDebug\vc142.pdb /FS /Zi /RTC1 /MDd " -o Debug\CMakeCUDACompilerId.cu.obj "C:\cygwin64\home\goodin\pbrt-v4\build\CMakeFiles\3.18.2\CompilerIdCUDA\CMakeCUDACompilerId.cu"

from pbrt-v4.

pierremoreau commented on June 28, 2024

Weird, in the initial log you had in #20 it did generate (correctly, you have a GV100 right?) for SM 7.0.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Yes, that's correct. I'm digging into the linux side now to see if there is a difference (which I would expect)

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Here's what I'm seeing on Linux:

/usr/local/cuda-11.0/bin/nvcc  $(CUDA_DEFINES) $(CUDA_INCLUDES) $(CUDA_FLAGS) -x cu -dc /home/goodin/pbrt-v4/src/pbrt/lights.cpp -o CMakeFiles/pbrt_lib.dir/src/pbrt/lights.cpp.o

CUDA_FLAGS = -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 --use_fast_math -lineinfo --maxrregcount 128 --gpu-architecture=sm_70 -O3 -DNDEBUG -D NDEBUG

CUDA_DEFINES = -DNVTX -DPBRT_BUILD_GPU_RENDERER -DPBRT_HAVE_MMAP -DPBRT_HAVE_POSIX_MEMALIGN -DPBRT_IS_LINUX -DPBRT_NOINLINE="attribute((noinline))" -DPTEX_STATIC

CUDA_INCLUDES = -I/home/goodin/NVIDIA-OptiX-SDK-7.1.0-linux64-x86_64/include -I/home/goodin/pbrt-v4/src -I/home/goodin/pbrt-v4/build -I/home/goodin/pbrt-v4/src/ext/openvdb/nanovdb

I don't know enough about CMake to understand why it is generating sm52

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Are you seeing the same thing under windows or is my CUDA install hosed somehow?

from pbrt-v4.

pierremoreau commented on June 28, 2024

Look at: https://github.com/mmp/pbrt-v4/blob/master/CMakeLists.txt#L136-L161

Do you happen to also have a Maxwell card in your box?

from pbrt-v4.

richardmgoodin commented on June 28, 2024

No, just a single GV100

from pbrt-v4.

pierremoreau commented on June 28, 2024

I will reboot to Windows in a bit, but last I checked it properly detected the RTX 2080 Ti.

Try modifying https://github.com/mmp/pbrt-v4/blob/master/cmake/checkcuda.cu or stepping through it to understand why it emits the wrong SM capability.

from pbrt-v4.

pierremoreau commented on June 28, 2024

Alternatively, you could force setting ARCH in the CMake to sm_70. But it would be nice to understand where the sm_52 is coming from.

from pbrt-v4.

pierremoreau commented on June 28, 2024

If you modify the main CMake file for example (just to add a print or something) and re-run CMake, what does it say after "CUDA Architecture: ¨? Here is what I get on Linux

-- Found CUDA: 
-- CUDA Architecture: sm_75

from pbrt-v4.

pierremoreau commented on June 28, 2024

I checked on Windows and I am getting sm_75 as expected.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

What was happening is that early on I was hand editing CMakeLists.txt to include Optix. When I do this it generates sm70 as expected. When I started running make-gui I would run the configure pass without OptiX set, edit it in the window and then run generate. I'm rebuilding now with sm70 but since you have always been running with sm75 I don't think that was the issue.

Did you take a look at the nvcc compile flag differences other than sm and is there anything jumping out that could cause the differences in kernel execution. It looks like there are quite a few differences. -maxregcount jumps out to me.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Is "-G" debug in nvcc

from pbrt-v4.

pierremoreau commented on June 28, 2024

Is "-G" debug in nvcc

Correct, debug info AFAIR.

I'll have a look at the flags but they looked reasonable. Also they are the same on Linux and Windows AFAIR.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I was just going by the two commands I sent. Maxreg and -G are different between the two. Windows sets maxregs to 0, linux to 128. Windows sets -G and linux doesn’t. Let me go back and confirm the -G and make sure I didn’t build with debug accidentally.

…

On Aug 31, 2020, at 11:16 AM, Pierre Moreau ***@***.***> wrote: Is "-G" debug in nvcc Correct I'll have a look at the flags but they looked reasonable. Also they are the same on Linux and Windows AFAIR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADB45JPZOPRFNRLD3FXKEQTSDO5GJANCNFSM4QPXW65Q>.

from pbrt-v4.

mmp commented on June 28, 2024

Regarding the sm_52 thing in #24 (comment), that looks like the "checkcuda" thing around line 128 of CMakeLists.txt, where it's just compiling a small program to run to check the GPU's capabilities, not the main build.

I did just notice this:

        if (CMAKE_BUILD_TYPE MATCHES Release)
          set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo --maxrregcount 128")
        else()
          set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -G -g")
        endif ()

If you're doing a RelWithDebInfo build, I'm wondering if that's not working as expected...

from pbrt-v4.

pierremoreau commented on June 28, 2024

I saw that part and changed it locally and then thought "Oh, I think this is just adding debug symbols so it should be fine to leave as-is", so I then reverted it.

So re-compiling with it swapped and testing against "Debug". nvcc is warning about maxrregcount being redefined. I'm not sure where the maxregcount 0 is coming from.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Here is what I'm seeing from my Release build. Note the maxregcount seems to contradict the CMake code above. also the -G is there. This was copied from my build output directly from VS2019.

22>C:\cygwin64\home\goodin\pbrt-v4\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_70,code="sm_70,compute_70" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.1.0\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src" -I"C:\cygwin64\home\goodin\pbrt-v4\build" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openvdb\nanovdb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\stb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\IlmBase\config" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\OpenEXR\config" -I"C:\Program Files (x86)\PBRT-V4\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\filesystem" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\ptex\src\ptex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\double-conversion" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -Xcompiler="/EHsc -Ob2" -g -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\Release\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\Release\cameras.obj "C:\cygwin64\home\goodin\pbrt-v4\src\pbrt\cameras.cpp"

from pbrt-v4.

pierremoreau commented on June 28, 2024

Okay, so doing

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 22c8abc..797db35 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -143,10 +143,10 @@ if (CMAKE_CUDA_COMPILER)
                          OUTPUT_VARIABLE ARCH)

         set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++17")
-        if (CMAKE_BUILD_TYPE MATCHES Release)
-          set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo --maxrregcount 128")
-        else()
+        if (CMAKE_BUILD_TYPE MATCHES Debug)
           set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -G -g")
+        else()
+          set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo --maxrregcount 128")
         endif ()

and it's now rendering in about 1200 seconds it looks like it. So progress!

from pbrt-v4.

pierremoreau commented on June 28, 2024

@mmp If you change --maxrregcount into -maxrregcount, CMake will detect it and avoid emitting the -maxrregcount 0. It's a similar issue to https://gitlab.kitware.com/cmake/cmake/-/issues/20164.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

What's weird is I was running a full release build, so the change shouldn't effect me right?

from pbrt-v4.

pierremoreau commented on June 28, 2024

Probably not, which would explain why you were still getting much better timings than I did, even though it was still slower than what Matt gets on Linux.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I just ran a rebuild of the ToT. No "-G" but maxrregcount=0 is still there

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Oh, I see now. There are both -maxrregcount=0 and --maxrregcount 128

from pbrt-v4.

pierremoreau commented on June 28, 2024

@richardmgoodin Did you apply the change I suggested here?

from pbrt-v4.

richardmgoodin commented on June 28, 2024

No, trying it now

from pbrt-v4.

pierremoreau commented on June 28, 2024

@mmp \o/ With the CMake tweaks, I now render the scene in 14.3 seconds on Linux! So now only need to figure out where the two remaining orders of magnitude are coming from on Windows.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

is this in the windows-gpu-rework branch?

from pbrt-v4.

pierremoreau commented on June 28, 2024

The 14.3 seconds was on the master branch; I haven’t tried the CMake tweaks + the windows-gpu-rework branch on Linux

from pbrt-v4.

pierremoreau commented on June 28, 2024

I take it back: testing the windows-gpu-rework + CMake tweak on Windows with only the RTX 2080 Ti in the box, and it rendered in 16 seconds.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I’m building the untweaked version on Windows now with the maxrrregcount change. I’l then rebuild the windows-gpu-rework version and compare.

…

On Aug 31, 2020, at 12:27 PM, Pierre Moreau ***@***.***> wrote: The 14.3 seconds was on the master branch; I haven’t tried the CMake tweaks + the windows-gpu-rework branch on Linux — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADB45JP5ZEXRCEKLWH55FY3SDPFOZANCNFSM4QPXW65Q>.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I don't see anything unexpected anymore but just for the record, here's what I'm seeing for a full release build.

C:\cygwin64\home\goodin\pbrt-v4\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_70,code="sm_70,compute_70" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.1.0\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src" -I"C:\cygwin64\home\goodin\pbrt-v4\build" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openvdb\nanovdb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\stb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\IlmBase\config" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\OpenEXR\config" -I"C:\Program Files (x86)\PBRT-V4\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\filesystem" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\ptex\src\ptex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\double-conversion" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\Release -maxrregcount=128 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -lineinfo -Xcompiler="/EHsc -Ob2" -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="Release"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\Release\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\Release\subsurface.obj "C:\cygwin64\home\goodin\pbrt-v4\src\pbrt\gpu\subsurface.cpp"

from pbrt-v4.

richardmgoodin commented on June 28, 2024

killerroos-gold running full release no windows-gpu-rework 121.4

from pbrt-v4.

mmp commented on June 28, 2024

I just pushed the -maxrregcount fix (to both branches).

So just to be sure I'm caught up: current status is that Pierre is seeing expected good perf on both Windows and Linux, but Richard is still 10x off on both?

(Edit: re-reading, it looks like Pierre was just reporting linux above.)

There is the difference of the RTX cores, but IIRC Richard's slowdown was across all kernels.

from pbrt-v4.

pierremoreau commented on June 28, 2024

(Edit: re-reading, it looks like Pierre was just reporting linux above.)

I reported on both: the first comment was about Linux indeed, but the later one does say Windows in it. 😛

Probably the “I take it back” part at the beginning which was confusing; it was referring to this earlier comment about Windows results.

So to sum up, the scene is now rendering in ~14 seconds on Linux, and about 16–17 on Windows.

from pbrt-v4.

pierremoreau commented on June 28, 2024

but Richard is still 10x off on both?

The numbers Richard reported was from the master branch (IIRC), so without all the sync’ing improvements you made on the windows-gpu-rework branch. I got a 2x from switching to the windows-gpu-rework branch and that was in debug mode, so I wouldn’t be surprised if the effect is even more pronounced in release mode, so I think Richard should be getting close to our results now.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I'm building the windows-gpu-rework code now. I'm still am getting 123.5. I find it really suspicious I'm getting the same. I need to step away for another commitment for a couple of hours but when I get back I'll be a little more rigorous about verifying what is going on. Just to confirm to check out the branch I just add "--branch windows-gpu-rework" and the code pulled is only from the branch?

from pbrt-v4.

pierremoreau commented on June 28, 2024

Just to confirm to check out the branch I just add "--branch windows-gpu-rework" and the code pulled is only from the branch?

The checked out code should be coming from that branch, correct. You can easily switch between branches by using git checkout $branchName or git switch $branchName, so you do not need to clone every time you want to test a different branch.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I just pulled top of tree and built under Linux. Killeroos-gold is running at 17.9s. So it doesn't seem to be a GV100 issue.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

windows top of tree RelWithDebInfo. Killeroos-gold 121.3s. I also got a build failure when building with a new pull that effected imgtool and cyhair2pbrt. Cleaning the build and rebuilding resolved the failure. Here's the NVCC line:

23>C:\cygwin64\home\goodin\pbrt-v4\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_70,code="sm_70,compute_70" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.1.0\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src" -I"C:\cygwin64\home\goodin\pbrt-v4\build" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openvdb\nanovdb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\stb" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\IlmBase\config" -I"C:\cygwin64\home\goodin\pbrt-v4\build\src\ext\openexr\OpenEXR\config" -I"C:\Program Files (x86)\PBRT-V4\include" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\filesystem" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\ptex\src\ptex" -I"C:\cygwin64\home\goodin\pbrt-v4\src\ext\double-conversion" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\RelWithDebInfo -maxrregcount=128 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -lineinfo -Xcompiler="/EHsc -Zi -Ob1" -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DNVTX -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\RelWithDebInfo\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\RelWithDebInfo\cameras.obj "C:\cygwin64\home\goodin\pbrt-v4\src\pbrt\cameras.cpp"

from pbrt-v4.

mmp commented on June 28, 2024

Very strange: I'm not seeing an -O3 being passed to nvcc there.. (Or --gpu-architecture=..)

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I've never seen "--gpu-architecture" in the windows build. Windows looks like it is passing /O3 but does nvcc understand the "/" prefix args?

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I'l looking at CMakeLists. It used to have "-O3" in a very old version but it isn't there now

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Look around line 149 where we made the change earlier. I don't see -O3 there. I do see "-std=c++17". At line 189 I see both "c++17" and "-O3". I don't know enough about CMake to know which it is using or why things are defined twice.

from pbrt-v4.

mmp commented on June 28, 2024

(BTW I have merged the windows-gpu-rework branch to master (with some improvements to work around the performance hit it previously introduced on Linux.) Let me know if I messed up anything on the WIndows front..

from pbrt-v4.

richardmgoodin commented on June 28, 2024

I added -O3 to the Release build. no improvement. The image looks correct with your changes. Here's my environment:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:48_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28540450_0

"C:/ProgramData/NVIDIA Corporation/OptiX SDK 7.1.0"

driver version 451.98

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Here's what CMake is finding
configuration:
Selecting Windows SDK version 10.0.18362.0 to target Windows 10.0.19041.
The CXX compiler identification is MSVC 19.27.29111.0
The C compiler identification is MSVC 19.27.29111.0
Detecting CXX compiler ABI info
Detecting CXX compiler ABI info - done
Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Professional/VC/Tools/MSVC/14.27.29110/bin/Hostx64/x64/cl.exe - skipped
Detecting CXX compile features
Detecting CXX compile features - done
Detecting C compiler ABI info
Detecting C compiler ABI info - done
Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Professional/VC/Tools/MSVC/14.27.29110/bin/Hostx64/x64/cl.exe - skipped
Detecting C compile features
Detecting C compile features - done
Found Git: C:/cygwin64/bin/git.exe (found version "2.28.0")
Looking for pthread.h
Looking for pthread.h - not found
Found Threads: TRUE
Found ZLIB: C:/Program Files (x86)/PBRT-V4/lib/zlibstatic.lib (found version "1.2.8")
Configure ILMBASE Version: 2.5.3 Lib API: 25.0.2
CMake Warning (dev) at src/ext/openexr/IlmBase/config/IlmBaseSetup.cmake:56 (option):
Policy CMP0077 is not set: option() honors normal variables. Run "cmake
--help-policy CMP0077" for policy details. Use the cmake_policy command to
set the policy and suppress this warning.

For compatibility with older versions of CMake, option is clearing the
normal variable 'BUILD_SHARED_LIBS'.
Call Stack (most recent call first):
src/ext/openexr/IlmBase/CMakeLists.txt:35 (include)
This warning is for project developers. Use -Wno-dev to suppress it.

Looking for include file ucontext.h
Looking for include file ucontext.h - not found
-- WARNING pkg-config generation disabled
Configure OpenEXR Version: 2.5.3 Lib API: 25.0.2
Performing Test OPENEXR_IMF_HAVE_SYSCONF_NPROCESSORS_ONLN
Performing Test OPENEXR_IMF_HAVE_SYSCONF_NPROCESSORS_ONLN - Failed
Performing Test OPENEXR_IMF_HAVE_GCC_INLINE_ASM_AVX
Performing Test OPENEXR_IMF_HAVE_GCC_INLINE_ASM_AVX - Failed
clang-format not found.
Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
Looking for a CUDA compiler
Looking for a CUDA compiler - C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/bin/nvcc.exe
Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0 (found version "11.0")
Found CUDA: 11.0
The CUDA compiler identification is NVIDIA 11.0.194
Detecting CUDA compiler ABI info
Detecting CUDA compiler ABI info - done
Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/bin/nvcc.exe - skipped
Detecting CUDA compile features
Detecting CUDA compile features - done
checkcuda.cu

Creating library C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.lib and object C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.exp

CUDA Architecture: sm_70
Performing Test COMPILER_SUPPORTS_MARCH_NATIVE
Performing Test COMPILER_SUPPORTS_MARCH_NATIVE - Failed
Performing Test HAVE_MMAP
Performing Test HAVE_MMAP - Failed
Performing Test HAS_INTRIN_H
Performing Test HAS_INTRIN_H - Success
Unable to find -lprofiler
Performing Test HAVE_DECLSPEC_NOINLINE
Performing Test HAVE_DECLSPEC_NOINLINE - Success
Performing Test HAVE_ATTRIBUTE_NOINLINE
Performing Test HAVE_ATTRIBUTE_NOINLINE - Failed
Performing Test HAVE__ALIGNED_MALLOC
Performing Test HAVE__ALIGNED_MALLOC - Success
Performing Test HAVE_POSIX_MEMALIGN
Performing Test HAVE_POSIX_MEMALIGN - Failed
Performing Test INT64_IS_OWN_TYPE
Performing Test INT64_IS_OWN_TYPE - Failed
Configuring done
generation:
Selecting Windows SDK version 10.0.18362.0 to target Windows 10.0.19041.
Configure ILMBASE Version: 2.5.3 Lib API: 25.0.2
-- WARNING pkg-config generation disabled
Configure OpenEXR Version: 2.5.3 Lib API: 25.0.2
clang-format not found.
Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
Found CUDA: 11.0
checkcuda.cu

Creating library C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.lib and object C:\cygwin64\home\goodin\pbrt-v4\build\checkcuda.exp

CUDA Architecture: sm_70
Unable to find -lprofiler
Configuring done

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Maybe I'm going back to the same well too much but I compared nvcc lines between the Linux and Windows build. I'm seeing two differences:

Windows:
-maxrregcount=128
Linux:
-maxrregcount 128

Windows:
-gencode=arch=compute_70,code="sm_70,compute_70"
Linux:
--gpu-architecture=sm_70

Every thing else is the same. I don't know enough CMake to change things over in the Windows build to go further. I only see line 153 in CMakeLists.txt which contains:

    set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++17")
    if (CMAKE_BUILD_TYPE MATCHES Debug)
      set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -G -g")
    else()
      set (CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --use_fast_math -lineinfo -maxrregcount 128 -O3")
    endif ()

Which I think is correct but not what I'm seeing in the build log. I'm assuming "--gpu-architecture=sm_70" are equivalent. I see the "-gpu-architecture" line at line 160 but I don't see it in the build log either.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

Pierre, what version of CMake are you running. I'm running:

$ cmake --version
cmake version 3.18.2
CMake suite maintained and supported by Kitware (kitware.com/cmake).

from pbrt-v4.

pierremoreau commented on June 28, 2024

Pierre, what version of CMake are you running.

I am also running CMake 3.18.2 on Windows.

Look around line 149 where we made the change earlier. I don't see -O3 there. I do see "-std=c++17". At line 189 I see both "c++17" and "-O3". I don't know enough about CMake to know which it is using or why things are defined twice.

The double definition is probably coming from one set being directed to nvcc while the other one is being forwarded to the host compiler for compiling the host code; this is something I wanted to look at regarding the CMake configuration and see if it could be improved.

From one of the command line examples you posted, the argument to -Xcompiler is being forwarded to cl.exe so nvcc won't be seeing the /O2 since it is meant for the host compiler: -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\RelWithDebInfo\pbrt_lib.pdb /FS /Zi /MD /GR"

From nvcc --help, for the --gpu-architecture entry:

For example, 'nvcc --gpu-architecture=sm_50' is equivalent to 'nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,compute_50'.

and from the --generate-code (-gencode) entry:

In fact, --gpu-architecture=<arch> --gpu-code=<code>, ... is equivalent to --generate-code arch=<arch>,code=<code>,....

so these differences in the command lines you were seeing between Linux and Windows should not matter:

Windows:
-gencode=arch=compute_70,code="sm_70,compute_70"
Linux:
--gpu-architecture=sm_70

Very strange: I'm not seeing an -O3 being passed to nvcc there.. (Or --gpu-architecture=..)

and

I've never seen "--gpu-architecture" in the windows build.

It is passed in, it's even the very first argument passed to nvcc which is why we all missed it cause we were looking towards the end 😉 have another look at the command line.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

So Pierre, on your command line you are also seeing "-maxrregcount=128"? I ask because in all the NVCC docs I can find it supposedly should be "-maxrregcount 128" or "--maxrregcount=128". Can you tell me how to get CMake to generate "-maxrregcount 128" instead of "-maxxregcount=128"?

from pbrt-v4.

pierremoreau commented on June 28, 2024

Here is how one of the command lines look like:

D:\Builds\pbrt-v4>"D:\Program Files\NVIDIA Corporation\CUDA\v11.0\Toolkit\bin\nvcc.exe" -gencode=arch=compute_75,code="sm_75,compute_75" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64" -x cu -rdc=true -I"D:\Program Files\NVIDIA Corporation\CUDA\v11.0\Toolkit\include" -I"D:\Program Files\NVIDIA Corporation\OptiX\v7.1.0\include" -I"D:\Softwares\pbrt-v4\src" -I"D:\Builds\pbrt-v4" -I"D:\Softwares\pbrt-v4\src\ext\openvdb\nanovdb" -I"D:\Softwares\pbrt-v4\src\ext" -I"D:\Softwares\pbrt-v4\src\ext\stb" -I"D:\Softwares\pbrt-v4\src\ext\openexr\IlmBase\Imath" -I"D:\Softwares\pbrt-v4\src\ext\openexr\IlmBase\Half" -I"D:\Softwares\pbrt-v4\src\ext\openexr\IlmBase\Iex" -I"D:\Softwares\pbrt-v4\src\ext\openexr\OpenEXR\IlmImf" -I"D:\Builds\pbrt-v4\src\ext\openexr\IlmBase\config" -I"D:\Builds\pbrt-v4\src\ext\openexr\OpenEXR\config" -I"D:\Softwares\pbrt-v4\src\ext\zlib" -I"D:\Builds\pbrt-v4\src\ext\zlib" -I"D:\Softwares\pbrt-v4\src\ext\filesystem" -I"D:\Softwares\pbrt-v4\src\ext\ptex\src\ptex" -I"D:\Softwares\pbrt-v4\src\ext\double-conversion" -I"D:\Program Files\NVIDIA Corporation\CUDA\v11.0\Toolkit\include" --keep-dir x64\RelWithDebInfo -maxrregcount=128 --machine 64 --compile -cudart static -Xcudafe --diag_suppress=partial_override -Xcudafe --diag_suppress=virtual_function_decl_hidden -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=declared_but_not_referenced -Xcudafe --diag_suppress=implicit_return_from_non_void_function --expt-relaxed-constexpr --extended-lambda -Xnvlink -suppress-stack-size-warning --std=c++17 -lineinfo -Xcompiler="/EHsc -Zi -Ob1" -use_fast_math -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DNVTX -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -DWIN32 -D_WINDOWS -DNDEBUG -D_CRT_SECURE_NO_WARNINGS -DNVTX -DPBRT_IS_MSVC -DPBRT_BUILD_GPU_RENDERER -DPBRT_HAS_INTRIN_H -DPBRT_IS_WINDOWS -DNOMINMAX -D"PBRT_NOINLINE=__declspec(noinline)" -DPBRT_HAVE__ALIGNED_MALLOC -DPTEX_STATIC -D"CMAKE_INTDIR="RelWithDebInfo"" -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdpbrt_lib.dir\RelWithDebInfo\pbrt_lib.pdb /FS /Zi /MD /GR" -o pbrt_lib.dir\RelWithDebInfo\samples.obj "D:\Softwares\pbrt-v4\src\pbrt\gpu\samples.cpp"

Can you tell me how to get CMake to generate "-maxrregcount 128" instead of "-maxxregcount=128"?

I think you would need to patch CMake: if CMake was not adding that default -maxrregcount=0 you could just pass in -maxrregcount 128 as you would other arguments. I don't think it matters in the end, cause nvcc does not seem to complain about an unknown argument and when I looked at the different generated kernels, none were using more 128 regs.

from pbrt-v4.

richardmgoodin commented on June 28, 2024

So, I just deleted and reinstalled my entire Nvidia software stack. RelWithDebInfo killeroos-gold 17.4s. I think we are done with this issue.

from pbrt-v4.

pierremoreau commented on June 28, 2024

Glad to hear you got it running well, but it's unfortunate you needed to reinstall everything to achieve that!

from pbrt-v4.

mmp commented on June 28, 2024

Yaay!

from pbrt-v4.

Low GPU usage on Windows about pbrt-v4 HOT 84 CLOSED

Comments (84)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent