Coder Social home page Coder Social logo

cactus_test's Introduction

Disclaimer

!! ----------------------------------------------------------------- !!
This implementation was intended for my own needs to run cactus on a HPC via slurm.
It is NOT extensivly tested, so please use and adapt with caution.
!! ----------------------------------------------------------------- !!

Fixed Settings

Note, that all cactus jobs use --maskMode none in this approach, so all input genomes are required to be repeatmasked already.

Background

This is a snakemake adapataion (Mölder et al. 2021) of a stepwise execution of a progressive cactus alignment (Armstrong et al. 2020) to create a hierarchical alignment of multiple genome sequences (a .hal file, Hickey, G. et al. 2013).

The workflow is based on the official cactus documentation and the tuorial Cactus on the FASRC Cluster by the Harvard Faculty of Arts and Sciences. A key adaptation here was to switch from a filesystem image overlay to a directory overlay to allow simultaneous access to the overlay directory by processes running in parallel (preventing a can't open ./jobStore.img for writing, currently in use by another process error).

The features of the adaptation are:

  • management of jobs on HPC trough snakemake (eg. grid enginge / slurm scheduling)
  • execution within the cactus container for through apptainer

The main idea is to pull apart the individual commands created by cactus-prepare and store them within individual .sh files. These are ordered hierarchically by the cactus rounds, allowing snakemake to work through the individual rounds successively, while parallelizing jobs within each round:

sh/ 
└── cactus
    ├── round_1
    │   ├── job_1
    │   │   ├── step_1.sh
    │   │   └── step_2.sh
    │   └── job_1.sh
    ├── round_2
    │   ├── job_1
    │   │   ├── step_1.sh
    │   │   ├── step_2.sh
    │   │   └── step_3.sh
    │   ├── job_1.sh
    │   ├── job_2
    │   │   ├── step_1.sh
    │   │   ├── step_2.sh
    │   │   └── step_3.sh
    │   └── job_2.sh
    ├── round_3
    │   ├── job_1
    │   │   ├── step_1.sh
    │   │   ├── step_2.sh
    │   │   └── step_3.sh
    │   └── job_1.sh
    ├── round_4
    │   ├── job_1
    │   │   ├── step_1.sh
    │   │   ├── step_2.sh
    │   │   └── step_3.sh
    │   └── job_1.sh
    └── round_5
        ├── job_1
        │   ├── step_1.sh
        │   ├── step_2.sh
        │   └── step_3.sh
        └── job_1.sh

Preparations

Install apptainer

The snakemake pipeline uses apptainer to run cactus within the official container (docker://quay.io/comparative-genomics-toolkit/cactus:v2.5.1, specified within config.yml).

For this to work apptainer (or singularity) needs to be installed on your system.

You can check that the pices are in place by running:

apptainer shell docker://quay.io/comparative-genomics-toolkit/cactus:v2.5.1

If sucessfully opens a shell within the cactus container, you should be set (you can leave the shell by typing exit).

Container check (on first test, this will take a bit longer and report the download of the container):

$ apptainer shell docker://quay.io/comparative-genomics-toolkit/cactus:v2.5.1
INFO:    Using cached SIF image
Apptainer> which cactus
/home/cactus/cactus_env/bin/cactus
Apptainer> exit
$

Conda environemnt for R

The R scripts that chops down the cactus instructions requries R > 4.1. If you want to be independent of the locally installed R version, you can use the provided conda environment for the R processes.

To install the required conda environment run:

conda env create -f workflow/envs/r_base.yml

Then, when calling snakemake use the --use-conda flag, eg:

snakemake -c 1 --use-conda cactus_prep

Running the pipeline

The pipeline execution is a two-step process, as the number of 'rounds' of the cactus alignment can not be determined before the compilation of the cactus instructions.

Therefore, to run the pipeline, first call snakemake <options> cactus_prep to create the file "results/cactus/job_inventory.tsv" that lists the number of jobs within each cactus round and which is needed for the rules within cactus_stepwise.smk.

Then, run snakemake <options> cactus_stepwise for the actual alignment.

A minimal example of a complete run would be:

snakemake -c 1 --use-conda cactus_prep
snakemake -c 1 --use-singularity cactus_stepwise

For the example data, this should provide the final hal file (results/cactus/mammal_set.hal), as well as check of this file (results/cactus/mammal_set_check.check)

cat results/cactus/mammal_set_check.check 
#> hal v2.2
#> ((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;
#> 
#> GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments
#> Anc0, 2, 564579, 1, 0, 7936
#> Anc1, 2, 584785, 1, 9796, 41365
#> simHuman_chr6, 0, 597871, 1, 41960, 0
#> mr, 2, 610030, 1, 37996, 46842
#> simMouse_chr6, 0, 636262, 1, 46393, 0
#> simRat_chr6, 0, 647215, 1, 46074, 0
#> Anc2, 2, 576113, 1, 8881, 42841
#> simCow_chr6, 0, 602619, 1, 40218, 0
#> simDog_chr6, 0, 593897, 1, 40949, 0

Troubleshooting

On occasions where a job is being killed (eg. by slurm), I was able to pick it up again after a little manual manipulation of the job script of the failed task:

The idea is to append the cactus command with --restart. So if for example the job running sh/cactus/round_2/job_1/step_1.sh failed, the original script would need updating as follows:

# original step_1.sh
cactus-blast /tmp/js/2 results/cactus/mammal_set.txt /tmp/steps-output/Anc2.cigar --root Anc2  --maxCores 4
# update step_1.sh
cactus-blast /tmp/js/2 results/cactus/mammal_set.txt /tmp/steps-output/Anc2.cigar --root Anc2  --maxCores 4 --restart

After this manual manipulation, snakemake should be able to resume (and hopefully run the job to completion).

References

Armstrong, J. et al. (2020) Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature, 587(7833), 246–251.

Hickey, G. et al. (2013) HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics, 29(10), 1341–1342.

Kurtzer, G. M., Sochat, V., & Bauer, M. W. (2017) Singularity: Scientific containers for mobility of compute. PLOS ONE, 12(5), e0177459.

Mölder, F. et al. (2021) Sustainable data analysis with Snakemake (10:33). F1000Research.

cactus_test's People

Contributors

k-hench avatar

Watchers

 avatar

Forkers

rshuhuachen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.