Hello! I am running TractSeg with --nr_cpu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Low cpu usage? about tractseg HOT 20 CLOSED

mic-dkfz commented on August 19, 2024

Low cpu usage?

from tractseg.

Comments (20)

wasserth commented on August 19, 2024 1

I seems like using python 3 + pytorch 1.0 (instead of python 2.7 + pytorch 0.4) increases the CPU usage significantly. Runtime therefore decreases by a factor of at least 3x.

from tractseg.

wasserth commented on August 19, 2024

Are you using a GPU?
The mrtrix CSD extraction is not using all cores all the time. Seems like not all steps are optimized for multiprocessing.
Regarding TractSeg: It also contains several steps that only run on single core. And somehow pytorch does also not use all CPUs properly when running on CPU (instead of GPU).
The only thing that should lead to higher CPU usage is the tracking. However, also not to 800% all the time as saving and loading is also single core (and quite a few files have to be loaded for each tract).

from tractseg.

thijsdhollander commented on August 19, 2024

@jdtournier @bjeurissen, both CSD and MSMT-CSD run fully multi-threaded (almost) the entire time, right?

from tractseg.

jdtournier commented on August 19, 2024

both CSD and MSMT-CSD run fully multi-threaded (almost) the entire time, right?

By default, yes - assuming no other settings have been set that affect the number of threads (the -nthreads option, the config file NumberOfThreads config file entry, or the MRTRIX_NTHREADS environment variable). Same goes for tckgen (although less so with the -seed_dynamic option). All of these should scale pretty linearly with CPU core count, and typically have no trouble achieving 100% CPU utilisation.

Looking at the graph, I can't really tell which process corresponds to what colour (commands are truncated, can't tell difference between tckgen and python create fgClassified on CPU usage, tckgen doesn't feature on RAM usage). Which one corresponds to thedwi2fod call? But overall, given almost all the commands seem to plateau around the 180% mark, with no fluctuations about that (which you might expect if external factors were causing the slowdown), it looks like they're running with reduced thread counts - which you should be able to pick up by looking at the -debug output of any of these commands (it'll report how many threads it launched), or via a command like htop, which can give you a thread-level breakdown.

from tractseg.

wasserth commented on August 19, 2024

By CSD peak extraction I meant: dwi2response tournier + dwi2fod + sh2peaks
dwi2fod and sh2peaks are using all cpus most of the time. But dwi2response tournier is running most of the time on only a small cpu load.
@soichih I double checked my code. --nr_cpus is used correctly and correctly passed to mrtrix.

from tractseg.

soichih commented on August 19, 2024

All, thanks for helping me troubleshoot this problem.

The graph was taken on a CE without a GPU.

I've run another job and this time I see more clear peak for tckgen. I believe tckgen is running with all 8 cores that I am specifying, but it runs so quickly that it wasn't registering the cpu usage earlier.

Like @wasserth says, I believe the issue is with with pytorch not able to use all available CPUs (odd!)

from tractseg.

jdtournier commented on August 19, 2024

But dwi2response tournier is running most of the time on only a small cpu load.

I might be getting the wrong end of the stick here, it depends how this shows up on your monitoring. But since dwi2response is a script that calls other MRtrix3 commands and does very little itself, it's likely that the CPU usage involved in its invocation shows up under the various executables involved? So that would be dwi2fod, fod2fixel, fixel2voxel, and amp2response, primarily. Would that explain it?

from tractseg.

wasserth commented on August 19, 2024

I looked at the overall CPU usage when running dwi2response and most of the time it was rather low. I did not check in detail which commands were invoked by dwi2reponse and how much they were using.

from tractseg.

jdtournier commented on August 19, 2024

OK, what you should see is high CPU usage initially for dwi2fod, followed by fod2fixel and fixel2fod, then subsequently lots of very rapid runs of these same commands back to back as it iterates over smaller ROIs. Basically the first iteration is whole brain, the remaining ones over much smaller masks. I think the lower CPU usage during these fast iterations is not unexpected.

We could potentially improve this, but this would require turning the script into its own binary executable, and the gains are likely to be minimal - if CPU usage is not high, then we're already at the stage where each stage is itself so fast that the overhead and delays in command invocation are starting to become significant, which implies the script as a whole is fast and unlikely to be a bottleneck in any reasonable workflow (in my experience, the latter iterations are sub-second). If that's not your experience, then let us know and we can investigate.

The only other potential culprit here is amp2response, which has been reported to be slow in some circumstances, and is single-threaded (the bulk of it is one big matrix solve). Again, there may be ways to improve on this, but we'd need evidence that it's worth doing...

from tractseg.

wasserth commented on August 19, 2024

I think the speed of dwi2reponse is fine. The main problem for the low cpu usage of soichih seems to be related to pytorch not properly using all cores.

from tractseg.

jdtournier commented on August 19, 2024

👍 good to hear!

from tractseg.

soichih commented on August 19, 2024

amp2response is running slow, and it looks like it's running on a single thread.

I see that -nthreads option is not set on the command line.

hayashis@gpu1-pestillilab:~ 130 ps -ef | grep amp2response
hayashis 16914 14735 99 10:25 pts/6    00:51:56 /mrtrix3/bin/amp2response dwi.mif gm_mask.mif dirs.mif gm.txt -shells 5,998,1998,2996 -isotropic

I think dwi2response runs this command and I don't see -nthreads options set either.

hayashis@gpu1-pestillilab:~ $ ps -ef | grep dwi2
hayashis 14735  8877  0 10:24 pts/6    00:00:00 python /mrtrix3/bin/dwi2response msmt_5tt ./tractseg_output/Diffusion_MNI.nii.gz ./tractseg_output/5TT.mif ./tractseg_output/RF_WM.txt ./tractseg_output/RF_GM.txt ./tractseg_output/RF_CSF.txt -voxels ./tractseg_output/RF_voxels.mif -fslgrad ./tractseg_output/Diffusion_MNI.bvecs ./tractseg_output/Diffusion_MNI.bvals -mask ./tractseg_output/nodif_brain_mask.nii.gz

TractSeg/mrtrix.py should be setting -nthreads, but it looks like it's not working somehow.

By the way, I am setting --nr_cpus option on TractSeg.

/usr/bin/python /usr/local/bin/TractSeg -i dwi.nii.gz --raw_diffusion_input --csd_type csd_msmt_5tt --output_type tract_segmentation --keep_intermediate_files --postprocess --nr
_cpus 8 -o . --preprocess

TractSeg seems to be ignoring it?

from tractseg.

soichih commented on August 19, 2024

Sorry to muddy this thread, but.. does PyTorch use OpenMP? If so, you might need to set OMP_NUM_THREADS env parameter to make it run on multiple threads?

pytorch/pytorch#3021 (comment)

from tractseg.

jdtournier commented on August 19, 2024

amp2response is running slow, and it looks like it's running on a single thread.

It is single-threaded - see my previous comment about it. If it's problematic, I'm pretty sure we can fix that, but it'll be a while before I find the time to look into it...

from tractseg.

soichih commented on August 19, 2024

@jdtournier Ah, sorry I missed that.. thanks for the info! I was reading this doc > https://mrtrix.readthedocs.io/en/latest/reference/commands/amp2response.html

I wonder why it lists -nthread option though? Copy/paste error?

from tractseg.

jdtournier commented on August 19, 2024

I wonder why it lists -nthread option though?

It's one of our standard options, common across all apps. It does actually kick in for various operations other than the main processing itself (e.g the data preload, if it's deemed necessary, etc). Agree it's a bit misleading in this context, we've discussed options to fix that, but it's a low priority issue. Besides, I might fix that to make it properly multi-threaded at some point anyway...

from tractseg.

wasserth commented on August 19, 2024

When I run TractSeg -i Diffusion.nii.gz -o test --raw_diffusion_input --csd_type csd_ms mt_5tt --nr_cpus 8 I get the following output:

(ts_env) jakob@jakob-ubuntu ~/dev/bsp$ ps -ef | grep dwi2
jakob    16838  8763  0 11:24 pts/5    00:00:00 sh -c dwi2response msmt_5tt Diffusion.nii.gz test/tractseg_output/5TT.mif test/tractseg_output/RF_WM.txt test/tractseg_output/RF_GM.txt test/tractseg_output/RF_CSF.txt -voxels test/tractseg_output/RF_voxels.mif -fslgrad Diffusion.bvecs Diffusion.bvals -mask test/tractseg_output/nodif_brain_mask.nii.gz -nthreads 8

So it is passing the nr_cpus to mrtrix correctly. Are you using an older version of TractSeg maybe?

Regarding PyTorch: It is using OpenMP and I am passing the threads by setting torch.set_num_threads which will internally set the OpenMP number of threads. Not sure why utilisation is still so low then.

from tractseg.

soichih commented on August 19, 2024

@wasserth I am using https://github.com/MIC-DKFZ/TractSeg/archive/v1.7.1.zip

I just tested a script similar to the one posted here > pytorch/pytorch#3146

I had to increase the batch size to this

NUM_INPUTS = 1000
NUM_OUTPUTS = 1000
BATCH_SIZE = 128

.. but I was seeing a high CPU usage (600% for 8 threads) as expected.

Are you running pyTorch differently from the test code in some fundamental way? Are you sure torch.set_num_threads is getting called? It could also be that batch size needs to be increased?

from tractseg.

wasserth commented on August 19, 2024

torch.set_num_threads is correctly called. But I will test what happens if I increase the batch size.

from tractseg.

wasserth commented on August 19, 2024

I tried it with a higher batch size. This increased the runtime by around 30% but CPU usage is still way below 100%. The downside of increasing the batch size is higher memory usage (around 30GB of RAM).

from tractseg.

Low cpu usage? about tractseg HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent