Coder Social home page Coder Social logo

nf-core / kmermaid Goto Github PK

View Code? Open in Web Editor NEW
19.0 140.0 11.0 12.84 MB

k-mer similarity analysis pipeline

Home Page: https://nf-co.re/kmermaid

License: MIT License

Dockerfile 0.61% HTML 2.16% Python 11.56% Nextflow 85.67%
nf-core workflow pipeline nextflow kmer k-mer kmer-frequency-count kmer-counting

kmermaid's Introduction

nf-core/kmermaid

Compare DNA/RNA/protein sequences on k-mer content.

GitHub Actions CI Status GitHub Actions Linting Status Nextflow

install with bioconda Docker Cite with Zenodo Get help on Slack

Introduction

Workflow overview

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Quick Start

i. Install nextflow

ii. Install either Docker or Singularity for full pipeline reproducibility (please only use Conda as a last resort; see docs)

iii. Download the pipeline and test it on a minimal dataset with a single command

nextflow run nf-core/kmermaid -profile test,<docker/singularity/conda/institute>

Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

iv. Start running your own analysis!

nextflow run nf-core/kmermaid -profile <docker/singularity/conda/institute> --input '*_R{1,2}.fastq.gz' --genome GRCh37

See usage docs for all of the available options when running the pipeline.

Documentation

The nf-core/kmermaid pipeline comes with documentation about the pipeline, found in the docs/ directory:

  1. Installation
  2. Pipeline configuration
  3. Running the pipeline
  4. Output and how to interpret the results
  5. Troubleshooting

Usage

With a samples.csv file

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket --samples samples.csv

With R1, R2 read pairs

nextflow run nf-core/kmermaid --outdir s3://olgabot-maca/nf-kmer-similarity/ \
  --read_pairs 's3://bucket/sub-bucket/*{R1,R2}*.fastq.gz,s3://bucket/sub-bucket2/*{1,2}.fastq.gz'

With SRA ids

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket --sra SRP016501

With fasta files

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket \
  --fastas '*.fasta'

With bam file

nextflow run nf-core/kmermaid  --outdir s3://bucket/sub-bucket \
  --bam 'possorted_genome_bam.bam'

With split kmer

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket --samples samples.csv --split_kmer --subsample 1000

Credits

nf-core/kmermaid was originally written by Olga Botvinnik.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on Slack (you can join with this invite).

Citation

If you use nf-core/kmermaid for your analysis, please cite it using the following doi: 10.5281/zenodo.4143940

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
ReadCube: Full Access Link

kmermaid's People

Contributors

ewels avatar maxulysse avatar nf-core-bot avatar olgabot avatar phoenixaja avatar pranathivemuri avatar snafees avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kmermaid's Issues

No TEMPLATE branch

Hello!

I just pushed a new release of nf-core/tools with the template, but kmermaid failed to sync because it doesn't have a TEMPLATE branch (see nf-core/tools#709 (comment)).

Normally this branch is created when the pipeline repo is "vanilla" with no modifications from the nf-core/tools template. The TEMPLATE branch needs to have a common history with a commit in the main branch which has this state.

It looks like this pipeline wasn't created with nf-core create, so this is a little trickier. The pipeline has to have a manual merge. This process is described in https://nf-co.re/developers/sync#manual-synchronisation

Once you've done this once, all future automated PRs should work fine.

Shout if you have any problems!

Phil

Add option to skip compare

Lately, I have been completely ignoring the sourmash compare step/results so let's add an option to skip it!

Reasons for skipping:

  • Too many samples to compare so it'll take too long/run out of memory anyway
  • Won't actually use the compare result

Add `sourmash index` option

While I'm not doing a lot of sourmash compare (see #75) I may still want to make an SBT index.

Reasons:

  • SBT indices are easy to add a single new sample to, comparison matrices are not
  • SBT matrices may (eventually..) be localized and fed into KNN graph algorithm using tools like UMAP (sourmash-bio/sourmash#925)

BUG: `--peptide_fasta` is only for translation

Before adding khtools extract-coding/sencha translate, older versions of kmermaid allowed for --peptide_fasta to be a bunch of protein files who could then be sketched via sourmash directly. Now, --peptide_fasta is used only for sencha translate and there is no option to input translated protein fasta files directly. This is a problem for me because I have ~132 protein fasta files that took 3 days to process and up to ~12 hours (720m+) per fasta:

Screen Shot 2020-04-29 at 11 16 22 AM

I'd like to input these files directly into sketch making so I don't have to go through the translation step again.

Add option to remove hashes, e.g. ribosomal seqs

Is your feature request related to a problem? Please describe

For single cell, many genes change due to techncial and not biological reasons, and many of these are ribosomal genes or associated with single cell dissociation

Describe the solution you'd like

Add a fasta of sequences to remove, and/or add the option to provide a precomputed signature of hashes to remove

Describe alternatives you've considered

Additional context

Separate cell barcode counting and filtering

Is your feature request related to a problem? Please describe

Currently, if --tenxt_min_umi_per_cell is changed, then count_umi_per_cell restarts from zero, even though all the UMIs are already counted, it's the filtering that happens separately. The count_umi_per_cell process takes ~5-10 hours to complete per bam file, so it would be ideal to not have to rerun it.

Describe the solution you'd like

Set the min umi per cell in count_umi_per_cell to 0, then separately use Nextflow to filter the per-barcode UMI counts using the params.tenx_min_umi_per_cell value.

Describe alternatives you've considered

Additional context

Add hash2kmer.py code to get potential genes and sequences for each k-mer hash

Is your feature request related to a problem? Please describe

Currently, there is no automated way to go back to the sequence creating the hash for each k-mer. There is code to do this, like here but it is not yet integrated into kmermaid. Since there are often questions about what k-mers contribute to the cell type, having the underlying sequence and read associated with each k-mer would be very useful.

Describe the solution you'd like

Add hash2kmer as a process, performing it on each cell's signatures individually. For bam input, keep aligned and unaligned separate.

Describe alternatives you've considered

The way I do this right now is in a hacky Jupyter notebook.

Only allow PRs to master from dev

To be a good developy we should be only allowing PRs to master from dev, and all other PRs should be to dev. This can be accomplished by adding these lines to the .travis.yml:

before_install:
  # PRs to master are only ok if coming from dev branch
  - '[ $TRAVIS_PULL_REQUEST = "false" ] || [ $TRAVIS_BRANCH != "master" ] || ([ $TRAVIS_PULL_REQUEST_SLUG = $TRAVIS_REPO_SLUG ] && ([ $TRAVIS_PULL_REQUEST_BRANCH = "dev" ] || [ $TRAVIS_PULL_REQUEST_BRANCH = "patch" ]))'

Update to sourmash 4.0

Is your feature request related to a problem? Please describe

The newest Sourmash 4.0 release is much faster (many operations moved from Python to Rust), and includes a new sourmash sketch command that allows for separately making DNA and protein sketches. This is super helpful as the parameters for protein sketches vs DNA sketches are different. Additionally, it now has native support for amino acid k-mer sizes! E.g. I'd like to do:

  • DNA nucleotide k-mer size=21
  • Protein amino acid k-mer size=10
  • Dayhoff amino acid k-mer size=17

Right now, this has to be done all in one command, and all alphabets get k-merized at each k-size, which doesn't really make sense. Dayhoff with nucleotide k=21 has far too low information content to be usable. While DNA has 4^21 options at ksize=21, since Dayhoff is an amino acid alphabet, the k-mer size is really 21/7, so 6^7 << 4^21 and doesn't have enough information to distinguish between cell types. It's basically random at that point.

https://github.com/dib-lab/sourmash/blob/5e66db91e62353de2b79f23cd198ef6f5c5544d1/doc/sourmash-sketch.md

Describe the solution you'd like

Add Sourmash 4.0: https://anaconda.org/bioconda/sourmash
(released ~1 week ago)

Describe alternatives you've considered

Could stay with current Sourmash but this is the future!!

Additional context

NA

TEMPLATE needs cleaning

Sorry about this, one last issue with your TEMPLATE branch..

Creating it in #92 did work and sure enough, you are now getting automated sync PRs such as #138

However, the TEMPLATE branch should only contain commits with the pure template and not any commits with a developed pipeline on it. Typically this means that there should only be a handful of commits on that branch (eg. rnaseq TEMPLATE has 13 commits). However it seems that your TEMPLATE has most of the development history in it with 1309 commits: https://github.com/nf-core/kmermaid/commits/TEMPLATE

This doesn't prevent the sync from working, but we discovered after a lot of trial and error that this makes the automated sync PRs super difficult to merge. Because git sees that you used to have the pipeline code in TEMPLATE and then deleted it to replace it with a vanilla nf-core create output it consistently wants to delete most of your pipeline in each sync PR - not great ๐Ÿ˜“

The solution is hopefully not too bad - assuming that your TEMPLATE branch commit history does go back far enough to have vanilla commits, we can just git reset --hard <sha> to that commit and then force push, wiping all of the commit history from the branch. Then we can re-do the nf-core sync to bring TEMPLATE up to date.

Once cleaned, this problem should hopefully not happen again. You can see in #138 that we changed the automated sync procedure to create a new branch each time and form a PR from that instead of directly from TEMPLATE. This means that you can fix the merge conflicts in that PR without pushing the dev history back into TEMPLATE.

I hope this makes sense, shout if you have any questions or would like some help ๐Ÿ‘๐Ÿป

Separate outputs by program

Right now, all the sourmash compare outputs are flat in results but should be in results/sourmash/compare

KMERMAID: Convert all parameter docs to JSON schema

Hi!

this is not necessarily an issue with the pipeline, but in order to streamline the documentation group next week for the hackathon, I'm opening issues in all repositories / pipeline repos that might need this update to switch from parameter docs to auto-generated documentation based on the JSON schema.

This will then supersede any further parameter documentation, thus making things a bit easier :-)

If this doesn't apply (anymore), please close the issue. Otherwise, I'm hoping to have some helping hands on this next week in the documentation team on Slack https://nfcore.slack.com/archives/C01QPMKBYNR

Upgrade tqdm to 4.29.1 for tqdm.auto

As mentioned here: nteract/papermill#287

Otherwise get this error (from using #48 )

Caused by:
  Process `count_umis_per_cell (ANTOINE_LIVER)` terminated with an error exit status (1)

Command executed:

  count_umis_per_cell.py \
      --reads ANTOINE_LIVER__aligned.fastq.gz \
      --min-umi-per-cell 5001 \
      --cell-barcode-pattern '(CB|CB):Z:([ACGT]+)(\-1)?' \
      --molecular-barcode-pattern '(UB|XB):Z:([ACGT]+)' \
      --umis-per-cell ANTOINE_LIVER__n_umi_per_cell.csv \
      --good-barcodes ANTOINE_LIVER__barcodes.tsv

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/home/olga/.nextflow/assets/nf-core/kmermaid/bin/count_umis_per_cell.py", line 9, in <module>
      from tqdm.auto import tqdm
  ModuleNotFoundError: No module named 'tqdm.auto'

Add --scaled option for sketches

Right now there's only a "flat rate" option to scale the sequences by a flat number of hashes, but there should be an option to use --scaled which gets e.g. every 1/1000 hashes instead of the same number for all. This accounts for sequencing depth.

Personally, I like using log2 scaling because I think life is log2 scaled and it's much easier to increase to a "natural" amount that's larger. But I know others (@bluegenes included) prefer a little more control, e.g. --scaled 500 or --scaled 1. So we'll probably need to support both.

Possible parameter names:

  • --scaled_sketch
  • --scaled_sketch_log2

.. there's probably better options

While we're at it, should here be a non-log2 option for --log2sketchsize ? And maybe separated by underscores?

No space left on device for bam

@pranathivemuri Let me know if this is the right place to file this, or whether this should be in https://github.com/czbiohub/bam2fasta/issues/

Using bam2fasta in this pipeline, the docker image runs out of space:

[0;35m[nf-core/kmermaid] Pipeline completed with errors
ERROR ~ Error executing process > 'sourmash_compute_sketch_bam (possorted_genome_bam_molecule-dna_ksize-21_log2sketchsize-10)'

Caused by:
  Process `sourmash_compute_sketch_bam (possorted_genome_bam_molecule-dna_ksize-21_log2sketchsize-10)` terminated with an error exit status (1)

Command executed:

  sourmash compute \
    --ksize 21 \
    --dna \
     \
    --num-hashes $((2**10)) \
    --processes 32 \
    --count-valid-reads 0 \
    --line-count 350 \
     \
     \
    --save-fastas . \
     \
    --output possorted_genome_bam_molecule-dna_ksize-21_log2sketchsize-10.sig \
    --input-is-10x possorted_genome_bam.bam
  find . -type f -name "*.fasta" | while read src; do if [[ $src == *"|"* ]]; then mv "$src" $(echo "$src" | tr "|" "_"); fi done

Command exit status:
  1

Command output:
  (empty)

Command error:
  385018218it [5:24:57, 19305.21it/s]
  385020154it [5:24:57, 18632.53it/s]
  385022079it [5:24:57, 18707.53it/s]
  385023956it [5:24:57, 18336.06it/s]
  385025940it [5:24:57, 18724.69it/s]
  385027819it [5:24:57, 18663.82it/s]
  385029802it [5:24:57, 18631.38it/s]
  385031877it [5:24:57, 19141.33it/s]
  385033797it [5:24:57, 18604.88it/s]
  385035760it [5:24:58, 18900.91it/s]
  385037657it [5:24:58, 18336.40it/s]
  385039630it [5:24:58, 18558.35it/s]
  385041705it [5:24:58, 19165.53it/s]
  385043631it [5:24:58, 18932.62it/s]
  385045532it [5:24:58, 18821.15it/s]
  385047420it [5:24:58, 18601.51it/s]
  385049457it [5:24:58, 19039.14it/s]
  385051193it [5:24:58, 19747.42it/s]
  multiprocess.pool.RemoteTraceback:
  """
  Traceback (most recent call last):
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker
      result = (True, func(*args, **kwds))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 44, in mapstar
      return list(map(*args))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/cli.py", line 289, in <lambda>
      lambda x: func(x.encode('utf-8')),
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/tenx_utils.py", line 213, in bam_to_temp_fasta
      filenames = list(set(write_cell_sequences(cell_sequences, delimiter)))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/tenx_utils.py", line 249, in write_cell_sequences
      with open(filename, "a") as f:
  OSError: [Errno 28] No space left on device: '/tmp/tmpoxumgy9g/CGCCAAGAGAGTAATC-1.fasta'
  """

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/bin/sourmash", line 8, in <module>
      sys.exit(main())
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/sourmash/__main__.py", line 83, in main
      cmd(sys.argv[2:])
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/sourmash/command_compute.py", line 309, in compute
      fastas = bam2fasta_cli.convert(bam_to_fasta_args)
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/cli.py", line 290, in convert
      filenames, chunksize=chunksize))))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 320, in <genexpr>
      return (item for chunk in result for item in chunk)
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 735, in next
      raise value
  OSError: [Errno 28] No space left on device: '/tmp/tmpoxumgy9g/CGCCAAGAGAGTAATC-1.fasta'

Work dir:
  /home/olga/pureScratch/nextflow-intermediates/1c/b4f769b576791ee0f5a78e44ad72f5

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`

 -- Check '.nextflow.log' file for details

Looking at this error (specifically the solution on GitHub here), it can be resolved by setting docker.temp = "auto" which creates a new temporary directory for every image, as specified in the Nextflow Docker Scope.

Add option to make per-cell bam files

Currently, if one wants to count reads with differential hashes, in genes, one needs to grep/search the ENTIRE 22-gigabyte channel bam file for one single cell (out of ~700,000), which is extremely inefficient. So let's do this work up fron.
After filtering for the good barcodes, then add the option to create per-cell bam files which are useful for nf-predictorthologs.

script:
barcode_pattern = "CB:Z:${cell_barcode}-1|XC:Z:${cell_barcode}" 
"""
samtools view ${channel_bam} \\
  | rg --threads ${task.cpus}  '${barcode_pattern}' - \\
  | cat ${header_sam} - \\
  | samtools view -Sb > ${cell_barcode_bam}
"""

@lekhakaranam may be a good feature to add after the template merge (#93 )

Pipeline has 2 releases, but default branch is dev

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

  1. Command line:
  2. See error:

Expected behaviour

System

  • Hardware:
  • Executor:
  • OS:
  • Version

Nextflow Installation

  • Version:

Container engine

  • Engine:
  • version:
  • Image tag:

Additional context

Add sortMeRNA for removing ribosomal reads

Doing differential hash expression on the k-mer sketches generated from kmermaid results in a bunch of ribosomal genes and other sequencing-depth-dependent things. So let's add the option to remove ribosomal reads.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.