Coder Social home page Coder Social logo

segdupannotation2's Introduction

Snakemake Workflow: SegDupAnnotation2

Snakemake

Overview

A snakemake workflow for counting gene duplications given PacBio reads, a genome assembly, and a gene model.
Successor to SegDupAnnotation.

โ— Warning: This workflow is under active development and should not yet be assumed to be a final or portable tool.

Usage

Usage guide is listed in order of ease of use and portability.
Internet access is required for snakemake to download the docker image and conda environments.

Singularity

  • Install Snakemake and Singularity.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
    • If necessary don't forget to use the 'override_mem' and 'override_num_cores' parameters.
  • From repository root, run (which will automatically download and run the dockerfile from within singularity):
    • snakemake -c 1 -j 250 --use-singularity --use-conda --singularity-args " --bind \<path to an input file\>\[,\path to another input file\] "
    • Don't forget to bind paths of input files per the given config file.

Bare Metal with Conda

  • Install Snakemake and Mamba.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
  • From repository root, run (which will automatically download and use pre-defined conda environments):
    • snakemake -c 1 -j 250 --use-conda

Bare Metal with Conda and SLURM

  • Install Snakemake and Mamba.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
  • From repository root, run (which will automatically download and use pre-defined conda environments):
    • snakemake -c 1 -j 250 --use-conda --slurm --default-resources slurm_account=\<your SLURM account\> slurm_partition=\<your SLURM partition\>, or
    • snakemake -c 1 -j 250 --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=\<your SLURM account\> --partition=\<your SLURM partition\>"

Bare Metal

  • Ensure all dependencies are installed on the system. For a list reference 'workflow/envs'.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
  • From repository root, run:
    • snakemake -c 1 -j 250

My Bare Metal with SLURM commands

  • conda activate sda
  • Then I run:
    • snakemake -c 1 -j 250 -k --slurm --default-resources slurm_account=mchaisso_100 slurm_partition=qcb, or
    • snakemake -c 1 -j 250 -k --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=mchaisso_100 --partition=qcb --output=slurm-logs/slurm-%j.out"

Salient Output File Specifications

results/G01_dups_<isoform_grouping_type>.bed

Column Description
#chr Gene copy's position in assembly.
start ^
end ^
gene Gene name.
orig_chr Position of gene copy's original copy in assembly.
orig_start ^
orig_end ^
strand Strand on which gene copy is on: 0 for 'Original', '1' for reverse.
haplotype The haplotype of the hit. This requires 'haplotype1' or 'haplotype2' to be explicitly stated in the chromosome/scaffold/contig name.
p_identity Similarity to 'Original' gene calculated as: #matches/(#matches+#mismatches+#insertion_events+#deletion_events)
p_accuracy Similarity to 'Original' gene calculated as : #matches/(#matches+#mismatches+#insertions+#deletions)
identity Notes whether copy is the 'Original' or a resolved 'Copy'.
depth Mean gene depth over mean assembly depth.
depth_stdev Standard deviation of gene depth over mean assembly depth calculated using 100 bp bins.
copy_num Rounded depth value.
depth_by_vcf Copy number as determined by hmm's vcf output.

results/G03_per_gene_counts_<isoform_grouping_type>.tsv

Column Description
gene Gene name.
depthNormalizedToAsm Mean of depth across all gene copies divided by mean assembly depth.
depth Sum of depth across all gene copies.
measuredCov depth rounded to nearest whole number.
copyCount Number of total gene copies - resolved plus collapsed.
resolvedCount Number of resolved gene copies.

results/G04_summary_stats_<isoform_grouping_type>.tsv
Contains tab delimited summary statistics.

Tips

If rerunning pipeline using the same input assembly and reads, retain and copy over files A01_assembly.fasta and A04_assembly.bam as aligning reads to assembly is the longest step.

segdupannotation2's People

Contributors

krabbani3 avatar

Stargazers

 avatar zhangwenda avatar

Watchers

Mark Chaisson avatar Mitchell Robert Vollger avatar  avatar zhangwenda avatar

Forkers

yuzhenpeng

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.