Coder Social home page Coder Social logo

snakemake-profiles / lsf Goto Github PK

View Code? Open in Web Editor NEW
36.0 4.0 21.0 168 KB

Snakemake profile for running jobs on an LSF cluster

License: MIT License

Shell 0.55% Python 98.48% Makefile 0.96%
snakemake bioinformatics bioinformatics-pipeline snakemake-pipelines lsf profile snakemake-profile cluster cluster-profile

lsf's Introduction

Snakemake LSF profile

GitHub Workflow Status Code coverage Code style: black Python versions License

πŸ“’ NOTICE: We are seeking volunteers to maintain this repository as the current maintainers no longer use LSF. See this issue. πŸ“’

Snakemake profile for running jobs on an LSF cluster.

Table of Contents

Install

Dependencies

This profile is deployed using Cookiecutter. If you do not have cookiecutter installed it can be easily installed using mamba or pip by running:

pip install --user cookiecutter
# or
mamba create -n cookiecutter -c conda-forge cookiecutter
mamba activate cookiecutter

If neither of these methods suits you, then visit the installation documentation for other options.

Profile

Download and set up the profile on your cluster

# create configuration directory that snakemake searches for profiles
profile_dir="${HOME}/.config/snakemake"
mkdir -p "$profile_dir"
# use cookiecutter to create the profile in the config directory
template="gh:Snakemake-Profiles/lsf"
cookiecutter --output-dir "$profile_dir" "$template"

You will then be prompted to set some default parameters.

LSF_UNIT_FOR_LIMITS

Default: KB
Valid options: KB, MB, GB, TB, PB, EB, ZB

⚠️IMPORTANT⚠️: This must be set to the same as LSF_UNIT_FOR_LIMITS on your cluster. This value is stored in your cluster's lsf.conf file. In general, this file is located at ${LSF_ENVDIR}/lsf.conf. So the easiest way to get this value is to run the following:

grep '^LSF_UNIT_FOR_LIMITS' ${LSF_ENVDIR}/lsf.conf

You should get something along the lines of LSF_UNIT_FOR_LIMITS=MB. If this command doesn't work, get in touch with your cluster administrator to find out the value.

As mentioned above, this is a very important parameter. It sets the scaling units to use for resource limits. So, if this value is MB on your cluster, then when setting the memory limit with -M 1000 the value is taken as megabytes. As snakemake allows you to set the memory for a rule with the resources: mem_mb parameter, it is important for this profile to know whether this then needs to be converted into other units when submitting jobs. See here for further information.

UNKWN_behaviour

Default: wait
Valid options: wait, kill

When LSF returns a job status of UNKWN do you want to wait for the host the job is running on to be contactable again - i.e. consider the job running - or kill it as outlined here?

ZOMBI_behaviour

Default: ignore
Valid options: ignore, kill

When LSF returns a job status of ZOMBI do you want to ignore this (not clean it up) or kill it as outlined here? Regardless of the option chosen, the job is considered failed.

latency_wait

Default: 5

This sets the default --latency-wait/--output-wait/-w parameter in snakemake.
From the snakemake --help menu

  --latency-wait SECONDS, --output-wait SECONDS, -w SECONDS
                        Wait given seconds if an output file of a job is not
                        present after the job finished. This helps if your
                        filesystem suffers from latency (default 5).

use_conda

Default: False
Valid options: False, True

This sets the default --use-conda parameter in snakemake.
From the snakemake --help menu

  --use-conda           If defined in the rule, run job in a conda
                        environment. If this flag is not set, the conda
                        directive is ignored.

use_singularity

Default: False
Valid options: False, True

This sets the default --use-singularity parameter in snakemake.
From the snakemake --help menu

  --use-singularity     If defined in the rule, run job within a singularity
                        container. If this flag is not set, the singularity
                        directive is ignored.

restart_times

Default: 0

This sets the default --restart-times parameter in snakemake.
From the snakemake --help menu

  --restart-times RESTART_TIMES
                        Number of times to restart failing jobs (defaults to
                        0).

print_shell_commands

Default: False
Valid options: False, True

This sets the default --printshellcmds/-p parameter in snakemake.
From the snakemake --help menu

  --printshellcmds, -p  Print out the shell commands that will be executed.

jobs

Default: 500

This sets the default --cores/--jobs/-j parameter in snakemake.
From the snakemake --help menu

  --cores [N], --jobs [N], -j [N]
                        Use at most N cores in parallel. If N is omitted or
                        'all', the limit is set to the number of available
                        cores.

In the context of a cluster, -j denotes the number of jobs submitted to the cluster at the same time1.

default_mem_mb

Default: 1024

This sets the default memory, in megabytes, for a rule being submitted to the cluster without mem_mb set under resources.

See below for how to overwrite this in a rule.

default_cluster_logdir

Default: "logs/cluster"

This sets the directory under which cluster log files are written. The path is relative to the working directory of the pipeline. If it does not exist, it will be created.

The log files for a given rule are organised into sub-directories. This is to avoid having potentially thousands of files in one directory, as this can cause file system issues.
If you want to find the log files for a rule called foo, with wildcards sample=a,ext=fq then this would be located at logs/cluster/foo/sample=a,ext=fq/jobid<jobid>-<uuid>.out for the standard output and with extension .err for the standard error.
<jobid> is the internal jobid used by snakemake and is the same across multiple attempts at running the same rule.
<uuid> is a random 28-digit, separated by -, and is specific to each attempt at running a rule. So if a rule fails, and is restarted, the uuid will be different.

The reason for such a seemingly complex log-naming scheme is explained in Known Issues. However, you can override the name of the log files for a specific rule by following the instructions below.

default_queue

Default: None

The default queue on the cluster to submit jobs to. If left unset, then the default on your cluster will be used.
The bsub parameter that this controls is -q.

default_project

Default: None

The default project on the cluster to submit jobs with. If left unset, then the default on your cluster will be used.

The bsub parameter that this controls is -P.

max_status_checks_per_second

Default: 10

This sets the default --max-status-checks-per-second parameter in snakemake.
From the snakemake --help menu

  --max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND
                        Maximal number of job status checks per second,
                        default is 10, fractions allowed.

max_jobs_per_second

Default: 10

This sets the default --max-jobs-per-second parameter in snakemake.
From the snakemake --help menu

  --max-jobs-per-second MAX_JOBS_PER_SECOND
                        Maximal number of cluster/drmaa jobs per second,
                        default is 10, fractions allowed.

max_status_checks

Default: 1

How many times to check the status of a job.

wait_between_tries

Default: 0.001

How many seconds to wait until checking the status of a job again (if max_status_checks is greater than 1).

profile_name

Default: lsf

The name to use for this profile. The directory for the profile is created as this name i.e. $HOME/.config/snakemake/<profile_name>.
This is also the value you pass to snakemake --profile <profile_name>.

Usage

Once set up is complete, this will allow you to run snakemake with the cluster profile using the --profile flag. For example, if the profile name was lsf, then you can run:

snakemake --profile lsf [snakemake options]

and pass any other valid snakemake options.

Standard rule-specific cluster resource settings

The following resources can be specified within a rule:

NOTE: these settings will override the profile defaults.

Non-standard rule-specific cluster resource settings

Since the deprecation of cluster configuration files the ability to specify per-rule cluster settings is snakemake-profile-specific.

Per-rule configuration must be placed in a file called lsf.yaml and must be located in the working directory for the pipeline. If you set workdir manually within your workflow, the config file has to be in there.

NOTE: these settings are only valid for this profile and are not guaranteed to be valid on non-LSF cluster systems.

All settings are given with the rule name as the key, and the additional cluster settings as a string (scalar) or list (sequence).

Examples

Snakefile

rule foo:
    input: "foo.txt"
    output: "bar.txt"
    shell:
        "grep 'bar' {input} > {output}"

rule bar:
    input: "bar.txt"
    output: "file.out"
    shell:
        "echo blah > {output}"

lsf.yaml

__default__:
  - "-P project2"
  - "-W 1:05"

foo:
  - "-P gpu"
  - "-gpu 'gpu resources'"

In this example, we specify a default (__default__) project (-P) and runtime limit (-W) that will apply to all rules.
We then override the project and, additionally, specify GPU resources for the rule foo.

For those interested in the details, this will lead to a submission command, for foo that looks something like

$ bsub [options] -P project2 -W 1:05 -P gpu -gpu 'gpu resources' ...

Although -P is provided twice, LSF uses the last instance.

__default__: "-P project2 -W 1:05"

foo: "-P gpu -gpu 'gpu resources'"

The above is also a valid form of the previous example but not recommended.

Quote-escaping

Some LSF commands require multiple levels of quote-escaping.
For example, to exclude a node from job submission which has non-alphabetic characters in its name (docs): bsub -R "select[hname!='node-name']".

You can specify this in lsf.yaml as:

__default__:
    - "-R \"select[hname!='node-name']\""

Known Issues

If running very large snakemake pipelines, or there are many workflow management systems submitting and checking jobs at the same time on the same cluster, we have seen examples where retrieval of the job state from LSF returns an empty status. This causes problems as we do not know whether or not the job has passed/failed. In these circumstances, the status-checker will look at the log file for the job to see if it is complete or still running. Thus, the reason for the seemingly complex log file naming scheme. As the status-checker uses tail to get the status, if the standard output log file of the job is very large, then status checking will be slowed down as a result. If you run into these problems and the tail solution is no feasible, the first suggestion would be to reduce --max_status_checks_per_second and see if this helps.
Please raise an issue if you experience this, and the log file check doesn't seem to work.

Contributing

Please refer to CONTRIBUTING.md.

lsf's People

Contributors

befh avatar bricoletc avatar dlaehnemann avatar jaicher avatar mbhall88 avatar nbmueller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lsf's Issues

`lsf.yaml` settings for specific rules are not used

I've tried using lsf.yaml to define rule-specific options multiple times over the last year, but it never has worked. Unfortunately, it is always faster to just edit my Snakefile, but I thought I'd post an Issue this time because it really would be more appropriate, vis-Γ -vis reusability, to have these settings in the YAML. In the most recent attempt, I am trying

__default__:
  - "-W 3:59"

foo: 
  - "-W 24:00"

but all my foo jobs still get created with the default settings. Has anyone else had success using this?

Setup CI tests

For testing, the ideal scenario would be to have a CI setup that executes an ad-hoc LSF cluster (e.g. via docker) to which we would automatically submit a few toy jobs. I am not sure whether such container images are available though.

From Snakemake-Profiles/doc#12 (comment)

Error with Line 207 on lsf_status.py

I am receving the following error:

Resuming incomplete job 9 with external jobid '3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) '.
Traceback (most recent call last):
  File ".config/snakemake/lsf/lsf_status.py", line 207, in <module>
    jobid = int(split_args[0])
ValueError: invalid literal for int() with base 10: '3.8.13'
WorkflowError:
Failed to obtain job status. See above for error message.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-05-04T193619.896311.snakemake.log

SnakeMake version: 7.5.0

It can be related to issue #45

Make queue configurable per rule and attempt

Sometimes we want to try running a job with 100 GB on 1st try, 300 GB on 2nd try and 1 TB of RAM on 3rd try. The third attempt might be needed to be submitted to a big mem queue, while the first two could be to the standard queue. The user should be able to specify which queue the profile should use at each attempt

Allow configuration of per-rule cluster settings

Issue

Profiles are a better general solution for cluster configuration. However, I still feel there is a need to allow for configuring special cluster settings for specific rules.
For example, on my cluster, if I want to use GPUs for a job I need to add a collection of parameters to the bsub call that are not relevant to normal jobs. It would be great if there was a way of specifying this.

Given that the cluster configuration is now deprecated it seems like a better way of doing per-rule configuration is to have a profile-specific way as each cluster will likely have different ways of configuring the "same" thing - e.g. GPU usage.

Proposal

Allow for an LSF-specific config file that effectively has much the same functionality as the deprecated cluster configuration YAML file. To ensure that it is obvious that this file is LSF-specific it will be required to be named lsf.yaml. Then, within the job submission process of this profile, we will look for the presence of this file. If there are specific settings for the current rule being submitted, then those settings will be applied.

To make this mirror LSF as much as possible, I will endeavour to make the configuration mirror the LSF bsub options as closely as possible, and of course, produce thorough documentation on usage.

EDIT: I think the best way to allow for the full suite of bsub commands is by just required the user to provide strings for the commands, otherwise I would end up implementing an entire API for bsub (which I am not keen on). So it would look something like

__default__: 
  - "-P project"
  - "-q queue"

my_rule:
  - "-P gpu"
  - "-m gpu-host"

other_rule: "-q special-queue -gpu 'num=2'"

Which allows either a list or a single string. I prefer the list as it is "neater" but I will support both.

@johanneskoester do you have any problems with this or any points you would like to raise/discuss?

LSF profile is broken for newest 7.1.1 (2022-03-07) snakemake version

Error:

Submitted group job 29137fca-fc8a-5511-88d3-2149b00a8b5e with external jobid '5412741 logs/cluster/group_1/unique/jobid29137fca_07818911-9e6a-460c-a46e-b0f3f0bdb675.out'.
Traceback (most recent call last):
  File "/homes/leandro/.config/snakemake/lsf/lsf_status.py", line 201, in <module>
    jobid = int(sys.argv[1])
ValueError: invalid literal for int() with base 10: '5412741 logs/cluster/group_1/unique/jobid29137fca_07818911-9e6a-460c-a46e-b0f3f0bdb675.out'
WorkflowError:
Failed to obtain job status. See above for error message.

The reason is that on snakemake 7.1.1 the job id is now quoted. LSF profile works on snakemake 7.1.0 though. Will provide a fix soon

Error submitting jobscript, bsub returns exit code 255

Hey!
Not quite sure if this is the right place for my issue, as I suspect it's more of a cluster issue than a problem with the profile. But maybe someone can still help.
I'm getting randomly failing job submissions on pipelines that usually work fine. The tracebacks are something along the lines of:

Traceback (most recent call last):
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 230, in submit
    external_job_id = self._submit_cmd_and_get_external_job_id()
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 216, in _submit_cmd_and_get_external_job_id
    output_stream, error_stream = OSLayer.run_process(self.submit_cmd)
  File "/homes/lukasw/.config/snakemake/lsf_short/OSLayer.py", line 40, in run_process
    completed_process = subprocess.run(
  File "[..]/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'bsub -M 1000 -n 1 -R 'select[mem>1000] rusage[mem=1000] span[hosts=1]' -o "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.out" -e "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.err" -J "[..]" -q short [..]/.snakemake/tmp.4tc5r3ou/snakejob.core_metaquast.140.sh' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 259, in <module>
    lsf_submit.submit()
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 236, in submit
    raise BsubInvocationError(error)
__main__.BsubInvocationError: Command 'bsub -M 1000 -n 1 -R 'select[mem>1000] rusage[mem=1000] span[hosts=1]' -o "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.out" -e "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.err" -J "[..]" -q short [..]/.snakemake/tmp.4tc5r3ou/snakejob.core_metaquast.140.sh' returned non-zero exit status 255.
Error submitting jobscript (exit code 1):

So bsub returns exit code 255, which leads the profile to raise a BsubInvocationError. Since this issue appears sporadically, I am wondering if this could be caused by file system latency (which is quite high on this system). I.e. the jobscript is not available yet at the time of trying to run bsub, would that make sense? Any ideas how I might try and debug this?

Lastly, could the profile help with mitigating this by e.g. waiting a few seconds between creating the jobscript and submitting it or an option to retry submission after a few seconds instead of raising an error at the first failure to submit?
Cheers!

New maintainers needed

I no longer use LSF and @leoisl's institue will soon move to Slurm. As such we would like to hand over maintenance of this respoitory to others who are actively using LSF. Please comment below if you are willing to take on this role.

Is it possible to use "immediate-submit" and "job dependencies"?

Hi,

Thank you for this wonderful profile. I was wondering whether it is possible to set "immediate-submit: true" in config.yaml file and still use job dependencies setting for LSF profile.

This kind of setting worked well in PBS profile but seems to trouble LSF profile as lsf-submit.py file cannot deal with -w options of LSF.

How can I use this profile so that my snakemake script submits multiple job command with specified dependency options?

Use workflow file name in log directory names

Hi,

I find myself running several workflows with the same rule name, leading to the log directory with that rule name holding log files for multiple workflows.

Currently I just rename rules to avoid that, but what do you think of prefixing the log dir name with the workflow file name?

Specifying non-standard rule-specific cluster resource settings

Hello,

In order to specify e.g. a time limit (which can be needed to select an appropriate queue) for a certain rule, one needs to create lsf.yaml (see also #7 and #13). This file needs to be in the working directory of the pipeline.

This can be difficult/slightly annoying to handle when lsf.yaml needs to be moved manually to the correct working directory before executing the pipeline is possible.

Is there a better way of doing this that I am missing?
Or asked differently, what is the difference between standard and non-standard resources?

Snakemake write wrong error to output file, jobs fail for no reason

Hello,
thanks for this plug-in first and foremost.

I noticed an Issue with my workflow and could not create a minimal not working example. The logs of the jobs and the error messages do not line up:

Take job 5767 for example:

[Mon Sep 28 14:14:25 2020]
rule mash_sketch:
    input: data/species/genomes/GCA_900290415.1.fna
    output: data/species/sketch/GCA_900290415.1.msh
    jobid: 5767
    wildcards: gca=GCA_900290415.1
    resources: mem_mb=4000

Submitted job 5767 with external jobid '9873015 logs/cluster/mash_sketch/gca=GCA_900290415.1/jobid5767_7192d265-413d-4ae5-8d21-1ac836805741.out'.

The log file "logs/cluster/mash_sketch/gca=GCA_900290415.1/jobid5767_7192d265-413d-4ae5-8d21-1ac836805741.err"
is for a different rule "predict proteins":

Building DAG of jobs...
Traceback (most recent call last):
  File "[...]/miniconda3/envs/drep_euk/lib/python3.7/site-packages/snakemake/__init__.py", line 709, in snakemake
    keepincomplete=keep_incomplete,
  File "[...]/miniconda3/envs/drep_euk/lib/python3.7/site-packages/snakemake/workflow.py", line 670, in execute
    dag.init()
  File "[...]/miniconda3/envs/drep_euk/lib/python3.7/site-packages/snakemake/dag.py", line 177, in init
    job = self.update(self.file2jobs(file), file=file, progress=progress)
  File "[...]/miniconda3/envs/drep_euk/lib/python3.7/site-packages/snakemake/dag.py", line 715, in update
    progress=progress,
  File "[...]/miniconda3/envs/drep_euk/lib/python3.7/site-packages/snakemake/dag.py", line 792, in update_
    file.inventory()
  File "[...]/miniconda3/envs/drep_euk/lib/python3.7/site-packages/snakemake/io.py", line 210, in inventory
    self._local_inventory(cache)
  File "[...]/miniconda3/envs/drep_euk/lib/python3.7/site-packages/snakemake/io.py", line 224, in _local_inventory
    with os.scandir(path) as scan:
FileNotFoundError: [Errno 2] No such file or directory: 'data/species/gmes/compute/GCA_014235955.1'

The out log file is correct and shows there was an issue with the job, but not which


------------------------------------------------------------
Sender: LSF System <lsf@hx-noah-11-07>
Subject: Job 9873015: <mash_sketch.gca=GCA_900290415.1> in cluster <EBI> Exited

Job <mash_sketch.gca=GCA_900290415.1> was submitted from host <hx-noah-39-01> by user <$USER> in cluster <EBI> at Mon Sep 28 14:14:25 2020
Job was executed on host(s) <hx-noah-11-07>, in queue <research-rh74>, as user <$USER> in cluster <EBI> at Mon Sep 28 14:14:26 2020
</homes/$USER> was used as the home directory.
</hps/research/$GROUP/$USER/projects/drep/snakemake> was used as the working directory.
Started at Mon Sep 28 14:14:26 2020
Terminated at Mon Sep 28 14:15:25 2020
Results reported at Mon Sep 28 14:15:25 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/hps/research/$GROUP/$USER/projects/drep/snakemake/.snakemake/tmp.hrw102kd/snakejob.mash_sketch.5767.sh
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   1.63 sec.
    Max Memory :                                 61 MB
    Average Memory :                             58.50 MB
    Total Requested Memory :                     4000.00 MB
    Delta Memory :                               3939.00 MB
    Max Swap :                                   468 MB
    Max Processes :                              4
    Max Threads :                                5
    Run time :                                   59 sec.
    Turnaround time :                            60 sec.

The output (if any) is above this job summary.

I dont know what is getting messed up, but log files (always just the err file) do not line up with jobs, so debugging is complicated and I suspect this messes with snakemake job success/fail evaluations.

Change `max_status_checks` to number >1

lsf_status.py is sometimes unable to retrieve the status of a running job and fails with

[Predicted exception] Error calling bjobs: Command 'bjobs -o 'stat' -noheader 212651182' returned non-zero exit status 255.

By default, this is tried once and then makes Snakemake crash. Changing the default value from 1 to, e.g., 10 in the line

max_status_checks: int = 1,

fixed this issue for me.

Is there a reason why this is set to 1 by default although the retry mechanism exists? How about setting it to a higher number or making this configurable via the cookiecutter template.

Better handling when the default cluster log directory does not exists

When the default cluster log directory does not exist, LSF won't be able to redirect the error and output stream to the log directory, causing it to fallback to the default behaviour, which is to e-mail the output of these streams (at least in EBI cluster). I guess it would be nice to either i) print a warning about this, or ii) error out saying that the log directory does not exist or iii) create the log directory automatically. Not sure which would be the best option, but I think ii) or iii) are fine, since I guess it will be common for users to forget to create the cluster log dir.

Or maybe there a better option?

Thanks!

Unable to set queue per rule

Hi!

I was trying to set queue for a rule by adding

    cluster: 
        queue='long'

but snakemake (5.22.1) gave me a SyntaxError Unexpected keyword cluster in rule definition (test.smk, line 4)

I also tried editing lsf.yaml file by adding test: '-q long', but the job was submitted to the default queue.

As I understand, the problem is with this piece of code
https://github.com/Snakemake-Profiles/snakemake-lsf/blob/2e6f23cbea58bb07bde5eff873be6bc87f2a4018/%7B%7Bcookiecutter.profile_name%7D%7D/lsf_submit.py#L63-L64

And I was able to hack it by replacing cluster with params in lsf-submit.py and adding this to the rule:

    params: 
        queue='long'

Then the job was submitted to the long queue as expected.

Add end-to-end tests with a LSF container?

We have lots of small unit tests where we mock LSF behaviour to ensure our code behaves as we expect. This is really nice as it is very easy to add new features or change the behaviour of existing features without having to test in a real LSF system, and check if it works. With mocking, testing new code is very easy.

However, we can have issues if we incorrectly mock LSF behaviour. Also, having some end-to-end tests in a real LSF system (e.g. in a container) could also be valuable. The only other snakemake profile that has tests is slurm, and their tests are these end-to-end tests in a slurm container: https://github.com/Snakemake-Profiles/slurm/tree/master/tests

End-to-end tests in a container with proper setup and proper tests do take time and effort (way more complicated than using mock). I am unsure if we should add these tests, and what is the priority.

Handling UNKWN and ZOMBI status

It seems that when a job is submitted or running in an unreachable host (e.g. host was reachable when job was submitted to it, and while it was executing, it became unreachable), its status becomes UNKWN: https://www.ibm.com/support/pages/how-requeue-lsf-jobs-unknwn-status-when-remote-execution-host-becomes-unreachable
The current profile will consider that an UNKWN job is still running, so will not try to kill it:

. I actually don't know if it is better to wait for this job to change its status eventually, or to simply directly kill the job, and try to resubmit it.
Killing an UNKWN job should be done with bkill -r: https://www.ibm.com/support/pages/how-requeue-lsf-jobs-unknwn-status-when-remote-execution-host-becomes-unreachable . Simply bkill won't do anything. This will take the job from UNKWN to ZOMBI, and then EXIT. Currently, the pipeline does not know the ZOMBI status:
[Predicted exception] Unknown job status: 'ZOMBI'
which will cause it to try some more times to get the status, and then eventually give up and check the log.

There are several approaches we can handle UNKWN and ZOMBI:

  1. Add ZOMBI to STATUS_TABLE and put as RUNNING (UNKWN is already like this), as eventually the ZOMBI job will become EXIT, and then we recognise it failed;
  2. When we see UNKWN, we bkill -r it. When we see ZOMBI, we say that the job FAILED;

Option 1 seems to need manual intervention though... An UNKWN job might return to a valid state if the execution host becomes reachable again (I think execution hosts become unreachable when there is an actual issue with the host, and thus they need to be rebooted anyway? So the job is lost anyway?). So the user might want to wait for the UNKWN job to return to a valid state, or bkill -r it, and then it becomes ZOMBI and EXIT.

Option 2 is more automatic, but requires more development and is more aggressive: as soon as we see UNKWN, we bkill -r it and resubmit it. I prefer option 2, as if the execution host became unreachable, I usually prefer to kill the job and submit to a healthy host than waiting for an unknown period of time to maybe it become reachable again.

In any case, Option 1 is already what is sort of implemented. User has to manually kill these jobs, and ZOMBI state is not recognised, but if everything fails we eventually go look at LSF log. So this is not an urgent issue, but maybe sth nice to fix at some point.

PS: there is a more annoying case where some jobs had the status of RUN for almost 1 day, and not a single line was executed. I think it might be related with this issue, but somehow LSF did not manage to tag these jobs as UNKWN. We can retrieve how much computing time was done in a job with bjobs -l:

Sun Aug 30 16:29:56: Resource usage collected.
                     The CPU time used is 71 seconds.

I am sure there is a better way that would allow us to query just the CPU time.

It would be nice also to deal with this, as my pipelines actually got stuck, and I thought they were just taking long, but actually nothing was being run... It seems to me that this issue happens when the execution host somehow can't execute anything. It might be solvable also with a LSF preexec command (on the hypothesis that the execution can't execute anything, it won't be able to execute a simple echo), or with this constant resource usage querying

Cluster cancel doesn't work with current job ID

Because we emit the log path and the job ID together, setting cluster cancel to bkill fails with

Terminating processes on user request, this might take some time.
Traceback (most recent call last):
  File "/homes/mbhall88/.config/snakemake/lsf/lsf_status.py", line 232, in <module>
    print(lsf_status_checker.get_status())
  File "/homes/mbhall88/.config/snakemake/lsf/lsf_status.py", line 156, in get_status
    status = self._query_status_using_bjobs()
  File "/homes/mbhall88/.config/snakemake/lsf/lsf_status.py", line 99, in _query_status_using_bjobs
    output_stream, error_stream = OSLayer.run_process(self.bjobs_query_cmd)
  File "/homes/mbhall88/.config/snakemake/lsf/OSLayer.py", line 40, in run_process
    completed_process = subprocess.run(
  File "/hps/software/users/iqbal/mbhall/miniconda3/envs/who-correspondence/lib/python3.10/subprocess.py", line 503, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/hps/software/users/iqbal/mbhall/miniconda3/envs/who-correspondence/lib/python3.10/subprocess.py", line 1149, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/hps/software/users/iqbal/mbhall/miniconda3/envs/who-correspondence/lib/python3.10/subprocess.py", line 2000, in _communicate
    ready = selector.select(timeout)
  File "/hps/software/users/iqbal/mbhall/miniconda3/envs/who-correspondence/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
3399781 logs/cluster/download_data/proj=PRJNA650381.sample=SAMN27765622.run=SRR18914164/jobid258_ff07f69d-f434-4b59-99f1-3f65c066bf08.out: Illegal job ID.
1 out of 1 calls to --cluster-cancel failed.  This is safe to ignore in most cases.

I guess we just can't emit the log path with the job ID.

@leoisl maybe we could emit the log path to stderr?

Ability to specify time limit (-W)

Hello, it would be nice to be able to set the job time limit (-W).
It seems like this would require adding an entry to a rule's resources, but it's not clear to me that there is a standard key for this in snakemake.

Complex shell quote escaping fails

Hello,

I was trying to instruct snakemake to ignore a specific lsf host which is faulty, so wrote this in my lsf.yaml:

__default__:
  - "-R \"select[hname!='hl-codon-32-02']\""

(This gets parsed as {'__default__': ['-R "select[hname!=\'hl-codon-32-02\']"']} by pyyaml)

The reason I do this is that if the hostname to ignore has special characters, the shell command needs to look like -R "select[hname!='hname']" (see IBM docs : If you need to include a hyphen (-) or other non-alphabetic characters within the string, enclose the text in single quotation marks, for example, bsub -R "select[hname!='host06-x12']")

However this gets passed at submission time as -R select[hname!='hl-codon-32-02'] but lsf cannot submit that (try bsub -Is -R select[hname!='hl-codon-32-02'] bash for eg)

Pfew- long story short, I have found a simple solution to this, and putting in a PR with unit tests.

BsubInvocation Error

I am trying to select some specific nodes on lsf cluster by making a new cluster.json file:

{
    "__default__": {
        "queue": "normal",
        "nodes": "lz-gpu lf-gpu ln-gpu lx-gpu ly-gpu ll-gpu",
        "extra": "",
        "threads": 4  # Default thread count, can be overridden per rule
    }
}

But i am getting this error?

File "/home/ryanr2/data/anaconda3/envs/sopa/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'bsub -M 200 -n 1 -R 'select[mem>200] rusage[mem=200] span[hosts=1]' -W 300 -o "logs/cluster/patch_segmentation_baysor/index=53/jobid55_6637d0e7-670b-4ba2-ad88-3580dba86228.out" -e "logs/cluster/patch_segmentation_baysor/index=53/jobid55_6637d0e7-670b-4ba2-ad88-3580dba86228.err" -J "patch_segmentation_baysor.index=53" -q normal /lila/data/chanjlab/ryanr2/sopa/workflow/.snakemake/tmp.5z0azw4k/snakejob.patch_segmentation_baysor.55.sh' returned non-zero exit status 255.

any idea?

Add default mem_mb, disk_mb, tmpdir

snakemake accepts a --default-resources option that allows us to specify the default mem_mb, disk_mb and tmpdir if a rule did not specify these directives. We can ask for these values in the cookie-cutter, and in config.yaml they can be specified using the following format:

default-resources: ["mem_mb=7777", "disk_mb=8888", "tmpdir='/hps/nobackup/research/zi/leandro/snakemake_test/temp'"]

Adding this to my config.yaml, and submitting this dummy job, which does not specify any of these directives:

rule make_file:
    output:
        "new_file.txt"
    shell:
        "touch {output}"

I got the following log:

[Tue Apr 19 11:39:19 2022]
rule make_file:
    output: new_file.txt
    jobid: 1
    resources: mem_mb=7777, disk_mb=8888, tmpdir=/hps/nobackup/research/zi/leandro/snakemake_test/temp

Job script is:

#!/bin/sh
# properties = {"type": "single", "rule": "make_file", "local": false, "input": [], "output": ["new_file.txt"], "wildcards": {}, "params": {}, "log": [], "threads": 1, "resources": {"mem_mb": 7777, "disk_mb": 8888, "tmpdir": "/hps/nobackup/research/zi/leandro/snakemake_test/temp"}, "jobid": 1, "cluster": {}}
cd /hps/nobackup/research/zi/leandro/snakemake_test && /hps/nobackup/research/zi/leandro/miniconda3/bin/python -m snakemake --snakefile '/hps/nobackup/research/zi/leandro/snakemake_test/Snakefile' 'new_file.txt' --allowed-rules 'make_file' --cores 'all' --attempt 1 --force-use-threads  --wait-for-files '/hps/nobackup/research/zi/leandro/snakemake_test/.snakemake/tmp.ftj3136o' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --skip-script-cleanup  --conda-frontend 'mamba' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 10 --scheduler 'greedy' --scheduler-solver-path '/hps/nobackup/research/zi/leandro/miniconda3/bin' --default-resources 'mem_mb=7777' 'disk_mb=8888' "tmpdir='/hps/nobackup/research/zi/leandro/snakemake_test/temp'" --mode 2 && exit 0 || exit 1

if we instead specify mem_mb in the job:

rule make_file:
    output:
        "new_file.txt"
    resources: mem_mb=100
    shell:
        "touch {output}"

we get this log:

[Tue Apr 19 11:49:00 2022]
rule make_file:
    output: new_file.txt
    jobid: 1
    resources: mem_mb=100, disk_mb=8888, tmpdir=/hps/nobackup/research/zi/leandro/snakemake_test/temp

with this job script:

#!/bin/sh
# properties = {"type": "single", "rule": "make_file", "local": false, "input": [], "output": ["new_file.txt"], "wildcards": {}, "params": {}, "log": [], "threads": 1, "resources": {"mem_mb": 100, "disk_mb": 8888, "tmpdir": "/hps/nobackup/research/zi/leandro/snakemake_test/temp"}, "jobid": 1, "cluster": {}}
cd /hps/nobackup/research/zi/leandro/snakemake_test && /hps/nobackup/research/zi/leandro/miniconda3/bin/python -m snakemake --snakefile '/hps/nobackup/research/zi/leandro/snakemake_test/Snakefile' 'new_file.txt' --allowed-rules 'make_file' --cores 'all' --attempt 1 --force-use-threads  --wait-for-files '/hps/nobackup/research/zi/leandro/snakemake_test/.snakemake/tmp.iq82hnq_' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --skip-script-cleanup  --conda-frontend 'mamba' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 10 --scheduler 'greedy' --scheduler-solver-path '/hps/nobackup/research/zi/leandro/miniconda3/bin' --default-resources 'mem_mb=7777' 'disk_mb=8888' "tmpdir='/hps/nobackup/research/zi/leandro/snakemake_test/temp'" --mode 2 && exit 0 || exit 1

so it indeed works

resources: mem_mb= specification doesn't change memory requirement for LSF job submission

Hello,

I am not sure whether there's something wrong in my Snakefile setup: When I use

            resources:
                mem_mb=32, time_min = 300
   The time limit is correctly passed to LSF (bsub) jobs, but memory
   configue is still using the memory limit set in the snake-profiles.

Is there someway to change the memory requirement for the run, without changing the snakemake-profile? If there's a quick way to edit the profile/to make the change, that would be a valuable solution for me as well (right now using cookie cutter to create a new profile works for me, but it seems a direct editing on the profile, could be quicker/more direct?)

Thanks a lot

Isaac

Add member

@johanneskoester would you be willing to add @leoisl as a contributor on this repository? He is a software engineer in my group and has been helping improve the profile. It would be great to be able to request PR reviews and have him be able to deal with issues too.

Does not work with Snakemake v8

This profile is incompatible with Snakemake version 8 because the --cluster command line arguments were deprecated (more information here).

Here is this same issue being discussed on the slurm profile Github.

I understand this package is no longer being actively maintained, but I thought I would leave a note in case it's helpful to someone in the future.

Snakemake version 7 still works fine with this profile (more specifically, I'm using v7.32.4).

Singularity availability on LSF

I work on a cluster where I need to do module load singularity to add singularity to the path. I must do this within every new job, not once at the start -- module load singularity && bsub "singularity" fails. Based on my experience with a solution similar to what you have here, I suspect that this setup won't allow --use-singularity. I predict it will fail with the same sort of error, which will manifest like this:

Building DAG of jobs...
WorkflowError:
The singularity command has to be available in order to use singularity integration. 

My current workaround is to do load module singularity as a prefix to the variable you call jobscript in the wrapper (line 98 of lsf-submit.py).

rule-specific config parsed as a dict, not as a list

Thanks for putting together the config docs! It looks like things diverged at one point, and the rule-specific yaml file gets parsed as a dict rather than a list. Sorry I should have saved the error message!

I was getting errors until looking at the sge profile and seeing how they specified their keys. I think the example in the lsf README

__default__:
  - "-P project2"
  - "-W 1:05"

foo:
  - "-P gpu"
  - "-gpu 'gpu resources'"

should become soemthing like

__default__:
  P: "project2"
  W: "1:05"

foo:
  P: gpu
  gpu: 'gpu resources'

Explicitly specifying memory units breaks the profile in older versions of LSF

I use LSF in two different clusters. In one, this is the LSF version:

$ lsid
IBM Spectrum LSF Standard 10.1.0.6, May 25 2018

Explicitly specifying memory units in this LSF version works just fine, e.g. this command works fine:

bsub -M 1000MB -n 1 -R 'select[mem>1000MB] rusage[mem=1000MB] span[hosts=1]' -o ls.out -e ls.err -J ls_job ls

In the other cluster, the version is a bit older:

$ lsid
IBM Spectrum LSF Standard 10.1.0.0, Jul 08 2016

Explicitly specifying memory units in this cluster fails:

$ bsub -M 1000MB -n 1 -R 'select[mem>1000MB] rusage[mem=1000MB] span[hosts=1]' -o ls.out -e ls.err -J ls_job ls
1000MB: MEMLIMT value should be a positive integer. Job not submitted.

It works if MB is removed:

$ bsub -M 1000 -n 1 -R 'select[mem>1000] rusage[mem=1000] span[hosts=1]' -o ls.out -e ls.err -J ls_job ls
Job <9491039> is submitted to default queue <standard>.

I did not track which update of LSF between these two versions enabled memory units to be specified in -M.

I wonder if:

  1. Should we support older versions of LSF? Or should we require users to have an updated version of LSF? Supporting only the most updated version has the advantage of using new features, like this one, but it also means it will work only for a subset of the users. Supporting older versions (e.g. 10.1.0.0) means that a way larger fraction of users can use this profile, but we have to code around the lacking features.
  2. We should choose which version to support, and setup a docker image with the chosen version, and make some real end-to-end tests with trivial pipelines to ensure the profile works. The testing framework we have now is fine, but we are mocking the behaviour of LSF. I am hoping LSF is backwards compatible, so we don't actually need to test in every version after the chosen one.

Better naming of cluster log files

I don’t like the format the names of the cluster log files have been changed to. It makes it impossible to figure out what log file relates to what job without digging into the snakemake stderr log (which is one of the main things I hate about nextflow).

Current implementation

self.logdir / "{jobid}_{random_string}.err".format(jobid=self.jobid, random_string=self.random_string)

Proposal

self.logdir / self.rule_name / self.wildcards_str / "jobid{jobid}_{random_string}.err".format(jobid=self.jobid, random_string=self.random_string)

Contrasting both implementations

# current
'logdir/2_random.out'
# proposed
'logdir/search_fasta_on_index/i=0/jobid2_random.out'

There are two major advantages I see to the new naming scheme.

  1. It is easier to find the log file for a specific job without having to search for its jobid in the snakemake log.
  2. For large pipelines that produce tens or hundreds of thousands of jobs, this will prevent there being potentially 200,000 log files in one directory. Which I guess might send the cluster into meltdown πŸ˜…

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.