Coder Social home page Coder Social logo

Comments (8)

dkuegler avatar dkuegler commented on September 3, 2024

It seems like the surface processing job has not started, can you share the log file of the surface jobs, you'll find that in $outputdir/slurm/logs/surf_*

Also, I have started providing some FAQs in https://github.com/Deep-MI/FastSurfer/blob/dev/doc/scripts/slurm.md.

from fastsurfer.

hpardoe avatar hpardoe commented on September 3, 2024

The surf_* log files are attached. I tried bumping up the time for the surface jobs to 1440 minutes via the following srun command but it didn't help. I also limited to twenty subjects but got the "sub_XXXX was terminated externally" message for each subject when running your code snippet provided in the FAQ.

./srun_fastsurfer.sh --partition seg=m3g --partition surf=comp --sd $outputdir --data $datadir --singularity_image /home/hpardoe/vn36/fastsurfer/fastsurfer-gpu.sif --work $workdir --fs_license /usr/local/freesurfer/7.4.0/license.txt --subject_list /home/hpardoe/vn36/fastsurfer/subjects/subject_list_20240407_split20.partaa --3T --mem surf=20 --mem seg=20 --time seg=20 --time surf=1440

surf_35329197_1.log
surf_35329197_2.log

from fastsurfer.

dkuegler avatar dkuegler commented on September 3, 2024

So it seems what is happening is that the script does not select the correct subjects for processing. We had some issues in the srun_fastsurfer.sh and brun_fastsurfer.sh scripts in 2.2.0 (stable), but I thought those issues are solved in dev.

I am not really sure which version you are using. It seem you might be not using the 2.2.0 stable version of the brun_fastsurfer.sh script.
Your logfile has:

[...]/fastsurfer/work/scripts/brun_fastsurfer.sh: line 330: (task_id - 1) * "20" / task_count: syntax error: operand expected (error token is ""20" / task_count")

This indicates it does not properly identify, which subjects it should process.

But if I compare line 330 with brun_fastsurfer in 2.2.0 (your stated version), that call is not actually there but in line 329 https://github.com/Deep-MI/FastSurfer/blob/v2.2.0/brun_fastsurfer.sh#L329.

Importantly, brun_fastsurfer.sh and srun_fastsurfer.sh do not need to be the same version as the backbone fastsurfer/fastsurfer image, so here is my recommendation:

Check out FastSurfer (branch dev) with git, call srun_fastsurfer.sh from there (it might be already what you are doing). (This will use the dev version of brun_fastsurfer.sh and srun_fastsurfer.sh, but specify the singularity image of fastsurfer 2.2.0 to get the benefit of the validated fastsurfer version.)

Next, add the --debug flag to srun_fastsurfer.sh -- this will put extra output into the log files (especially those files you attached).

there should also be log files like this: surf_35329197_1_*.log ... those are the log files from the individual fastsurfer processes and they might have extra information at the end on why the process was terminated externally. 1440 = 24h should really be enough for the surface pipeline. It might be the process gets killed because of something else...

from fastsurfer.

hpardoe avatar hpardoe commented on September 3, 2024

Oops you're correct, I provided the wrong version number. If I leave $fastsurferdir out of the singularity command to get the version number it gives me the wrong one:

singularity exec --nv --no-home -B $datadir:/data -B $outputdir:/output -B /usr/local/freesurfer/7.4.0:/fs_license /home/hpardoe/vn36/fastsurfer/fastsurfer-gpu.sif /fastsurfer/run_fastsurfer.sh --version
2.2.0+9f37d02 (wrong)

singularity exec --nv --no-home -B $datadir:/data -B $outputdir:/output -B $fastsurferdir:/fastsurfer -B /usr/local/freesurfer/7.4.0:/fs_license /home/hpardoe/vn36/fastsurfer/fastsurfer-gpu.sif /fastsurfer/run_fastsurfer.sh --version
outputs 2.3.0-dev+0000000 (correct)

I've added the --debug flag like this:
./srun_fastsurfer.sh --partition seg=m3g --partition surf=comp --sd $outputdir --data $datadir --singularity_image /home/hpardoe/vn36/fastsurfer/fastsurfer-gpu.sif --work $workdir --fs_license /usr/local/freesurfer/7.4.0/license.txt --subject_list /home/hpardoe/vn36/fastsurfer/subjects/aep_subject_list_20240407_split20.partaa --3T --mem surf=20 --mem seg=20 --time seg=20 --time surf=1440 --debug

The new log files are attached, however I can't find any extra surf_XXXXXXXX_1_*.log style files in the $outputdir/slurm/logs directory or in the $outputdir/subject_XXX/scripts directories. I've added slurm-submit_35364091_35364092.log in case that's helpful.

The contents of $outputdir/slurm/logs directory are:
cleanup_35364093.log
seg_35364091_1.log
seg_35364091_2.log
slurm-submit_35364091_35364092.log
surf_35364092_1.log
surf_35364092_2.log

I'm also unclear on how to specify the singularity image of fastsurfer 2.2.0 as per your instructions; I built fastsurfer-gpu.sif using the following command, not sure if this is appropriate or not:
singularity build fastsurfer-gpu.sif docker://deepmi/fastsurfer:latest

slurm-submit_35364091_35364092.log
surf_35364092_1.log
surf_35364092_2.log

from fastsurfer.

dkuegler avatar dkuegler commented on September 3, 2024
singularity exec --nv --no-home -B $datadir:/data -B $outputdir:/output -B $fastsurferdir:/fastsurfer -B /usr/local/freesurfer/7.4.0:/fs_license /home/hpardoe/vn36/fastsurfer/fastsurfer-gpu.sif /fastsurfer/run_fastsurfer.sh --version

outputs 2.3.0-dev+0000000 (correct)

Here, you are mounting the checked out version into the container.

I think what you might be confusing is that you indeed have 2 versions of fastsurfer.

  1. You have the version you checked out with git (that is 2.3.0-dev --- 000000 just means the script cannot figure out the git hash -- it also means I do not really know which version of the script you are using ... this will depend on the date you checked out fastsurfer.) -- this is installed in $fastsurferdir! -- -B $fastsurferdir:/fastsurfer puts this version into the container.
  2. You have 2.2.0-stable inside the image (the official, released docker image -- this is best practice).

This is all correct and fine and I would recommend you keep using the "wrong way". This uses the stable (validated) FastSurfer version for your analysis:

singularity exec --nv --no-home -B $datadir:/data -B $outputdir:/output -B /usr/local/freesurfer/7.4.0:/fs_license /home/hpardoe/vn36/fastsurfer/fastsurfer-gpu.sif /fastsurfer/run_fastsurfer.sh --version

Back to the error

[...]/work/scripts/brun_fastsurfer.sh: line 330: (task_id - 1) * "20" / task_count: syntax error: operand expected (error token is ""20" / task_count")

At this point, I am positive this issue is down to bash versions (https://www.cyberciti.biz/faq/how-do-i-check-my-bash-version/). The error occurs in the `/usr/bin/bash` shell, but I cannot reproduce the error. I have changed the brun_fastsurfer script a bit in #511, so you can try if this fixes your issue. You will have to check out the branch `fix/older-bash-srun-fastsurfer` to do so (keep the existing singularity image).

Other than that, you could report your bash version, as that seems to be the cause of this error (https://www.cyberciti.biz/faq/how-do-i-check-my-bash-version/).

Some other considerations:
Why does it work for 10 subjects, but fails for more? 
> The srun_fastsurfer script groups the processing into batches of 16 (in your case batches of 10, because 20 > 16, so we need 2 batches, and spread equally over those two batches it is 10 per batch), so work can be spread over multiple nodes and surface processing can start before the last segmentation finished. 

from fastsurfer.

hpardoe avatar hpardoe commented on September 3, 2024

#511 has fixed the issue! Thank you for your assistance

In case it's useful:
HPC login node bash version is GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

And if I do

cd $outputdir/slurm/logs
grep bash *

Two bash versions are listed, one for the surf log files and one for the seg log files:
surf_35433905_10.log:Running in shell /usr/bin/bash: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
seg_35433904_9.log:Running in shell /usr/bin/bash: GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

from fastsurfer.

dkuegler avatar dkuegler commented on September 3, 2024

Thanks!
That confirms it then in surf, the brun_fastsurfer.sh script runs on the host (bash 4) while in seg it runs in the docker/singularity container (bash 5, from Ubuntu 22.04). I only tested the script with bash 5 and it seems bash 4 is a bit picky as for the syntax in $(()) expressions.

from fastsurfer.

m-reuter avatar m-reuter commented on September 3, 2024

thanks! I merged #511 and think this can be closed now.

from fastsurfer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.