nf-core / kmermaid Goto Github PK

View Code? Open in Web Editor NEW

19.0 140.0 11.0 12.84 MB

k-mer similarity analysis pipeline

Home Page: https://nf-co.re/kmermaid

License: MIT License

Dockerfile 0.61% HTML 2.16% Python 11.56% Nextflow 85.67%

nf-core workflow pipeline nextflow kmer k-mer kmer-frequency-count kmer-counting

kmermaid's Introduction

Compare DNA/RNA/protein sequences on k-mer content.

Introduction

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Quick Start

i. Install nextflow

ii. Install either Docker or Singularity for full pipeline reproducibility (please only use Conda as a last resort; see docs)

iii. Download the pipeline and test it on a minimal dataset with a single command

nextflow run nf-core/kmermaid -profile test,<docker/singularity/conda/institute>

Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

iv. Start running your own analysis!

nextflow run nf-core/kmermaid -profile <docker/singularity/conda/institute> --input '*_R{1,2}.fastq.gz' --genome GRCh37

See usage docs for all of the available options when running the pipeline.

Documentation

The nf-core/kmermaid pipeline comes with documentation about the pipeline, found in the docs/ directory:

Usage

With a samples.csv file

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket --samples samples.csv

With R1, R2 read pairs

nextflow run nf-core/kmermaid --outdir s3://olgabot-maca/nf-kmer-similarity/ \
  --read_pairs 's3://bucket/sub-bucket/*{R1,R2}*.fastq.gz,s3://bucket/sub-bucket2/*{1,2}.fastq.gz'

With SRA ids

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket --sra SRP016501

With fasta files

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket \
  --fastas '*.fasta'

With bam file

nextflow run nf-core/kmermaid  --outdir s3://bucket/sub-bucket \
  --bam 'possorted_genome_bam.bam'

With split kmer

nextflow run nf-core/kmermaid --outdir s3://bucket/sub-bucket --samples samples.csv --split_kmer --subsample 1000

Credits

nf-core/kmermaid was originally written by Olga Botvinnik.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on Slack (you can join with this invite).

Citation

If you use nf-core/kmermaid for your analysis, please cite it using the following doi: 10.5281/zenodo.4143940

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
ReadCube: Full Access Link

kmermaid's People

Contributors

Stargazers

Watchers

Forkers

olgabot snafees pranathivemuri mzebinisa phoenixaja pythseq lekhakaranam ewels maxibor biofriends

kmermaid's Issues

No TEMPLATE branch

Hello!

I just pushed a new release of nf-core/tools with the template, but kmermaid failed to sync because it doesn't have a TEMPLATE branch (see nf-core/tools#709 (comment)).

Normally this branch is created when the pipeline repo is "vanilla" with no modifications from the nf-core/tools template. The TEMPLATE branch needs to have a common history with a commit in the main branch which has this state.

It looks like this pipeline wasn't created with nf-core create, so this is a little trickier. The pipeline has to have a manual merge. This process is described in https://nf-co.re/developers/sync#manual-synchronisation

Once you've done this once, all future automated PRs should work fine.

Shout if you have any problems!

Phil

Add option to skip compare

Lately, I have been completely ignoring the sourmash compare step/results so let's add an option to skip it!

Reasons for skipping:

Too many samples to compare so it'll take too long/run out of memory anyway
Won't actually use the compare result

Update to use nf-core/tools with TEMPLATE branch

According to the nf-core pipeline health page, and as I know, I didn't create this repo using the nf-core/tools repo 😱 -- this was before I knew anything and just created it from scratch and copied in stuff from nf-core/rnaseq as needed. Updating to this would enable the pipeline to be released as a 1.0 but requires quite a bit of git undertaking.

Add `sourmash index` option

While I'm not doing a lot of sourmash compare (see #75) I may still want to make an SBT index.

Reasons:

SBT indices are easy to add a single new sample to, comparison matrices are not
SBT matrices may (eventually..) be localized and fed into KNN graph algorithm using tools like UMAP (sourmash-bio/sourmash#925)

add a new release for kmermaid

1.0.0 was a year ago, and there have been many changes in each process - so let's release a new version

Write sketches to folders based on the sketch ID

i.e. change:

	publishDir "${params.outdir}/sketches", mode: 'copy'

To:

	publishDir "${params.outdir}/sketches/${sketch_id}/", mode: 'copy'

for sourmash_compute_sketch*

BUG: `--peptide_fasta` is only for translation

Before adding khtools extract-coding/sencha translate, older versions of kmermaid allowed for --peptide_fasta to be a bunch of protein files who could then be sketched via sourmash directly. Now, --peptide_fasta is used only for sencha translate and there is no option to input translated protein fasta files directly. This is a problem for me because I have ~132 protein fasta files that took 3 days to process and up to ~12 hours (720m+) per fasta:

I'd like to input these files directly into sketch making so I don't have to go through the translation step again.

Update versions for r-base and bioconductor-edger to latest versions as in nf-core/rnaseq

From an email with IBM people for installing kmermaid on the ppc64le infrastructure:

For example, bioconductor-edger=3.22.5, conda-BioHPDA on the IBM cluster at CZBiohub already included bioconductor-edger 3.26.5 for r-base 3.6.1

Add option to remove hashes, e.g. ribosomal seqs

Is your feature request related to a problem? Please describe

For single cell, many genes change due to techncial and not biological reasons, and many of these are ribosomal genes or associated with single cell dissociation

Describe the solution you'd like

Add a fasta of sequences to remove, and/or add the option to provide a precomputed signature of hashes to remove

Describe alternatives you've considered

Additional context

Separate cell barcode counting and filtering

Is your feature request related to a problem? Please describe

Currently, if --tenxt_min_umi_per_cell is changed, then count_umi_per_cell restarts from zero, even though all the UMIs are already counted, it's the filtering that happens separately. The count_umi_per_cell process takes ~5-10 hours to complete per bam file, so it would be ideal to not have to rerun it.

Describe the solution you'd like

Set the min umi per cell in count_umi_per_cell to 0, then separately use Nextflow to filter the per-barcode UMI counts using the params.tenx_min_umi_per_cell value.

Describe alternatives you've considered

Additional context

Add hash2kmer.py code to get potential genes and sequences for each k-mer hash

Is your feature request related to a problem? Please describe

Currently, there is no automated way to go back to the sequence creating the hash for each k-mer. There is code to do this, like here but it is not yet integrated into kmermaid. Since there are often questions about what k-mers contribute to the cell type, having the underlying sequence and read associated with each k-mer would be very useful.

Describe the solution you'd like

Add hash2kmer as a process, performing it on each cell's signatures individually. For bam input, keep aligned and unaligned separate.

Describe alternatives you've considered

The way I do this right now is in a hacky Jupyter notebook.

Add dashing/HyperLogLog as an option

Use this: https://github.com/dnbaker/dashing/

Only allow PRs to master from dev

To be a good developy we should be only allowing PRs to master from dev, and all other PRs should be to dev. This can be accomplished by adding these lines to the .travis.yml:

before_install:
  # PRs to master are only ok if coming from dev branch
  - '[ $TRAVIS_PULL_REQUEST = "false" ] || [ $TRAVIS_BRANCH != "master" ] || ([ $TRAVIS_PULL_REQUEST_SLUG = $TRAVIS_REPO_SLUG ] && ([ $TRAVIS_PULL_REQUEST_BRANCH = "dev" ] || [ $TRAVIS_PULL_REQUEST_BRANCH = "patch" ]))'

Update to sourmash 4.0

Is your feature request related to a problem? Please describe

The newest Sourmash 4.0 release is much faster (many operations moved from Python to Rust), and includes a new sourmash sketch command that allows for separately making DNA and protein sketches. This is super helpful as the parameters for protein sketches vs DNA sketches are different. Additionally, it now has native support for amino acid k-mer sizes! E.g. I'd like to do:

DNA nucleotide k-mer size=21
Protein amino acid k-mer size=10
Dayhoff amino acid k-mer size=17

Right now, this has to be done all in one command, and all alphabets get k-merized at each k-size, which doesn't really make sense. Dayhoff with nucleotide k=21 has far too low information content to be usable. While DNA has 4^21 options at ksize=21, since Dayhoff is an amino acid alphabet, the k-mer size is really 21/7, so 6^7 << 4^21 and doesn't have enough information to distinguish between cell types. It's basically random at that point.

https://github.com/dib-lab/sourmash/blob/5e66db91e62353de2b79f23cd198ef6f5c5544d1/doc/sourmash-sketch.md

Describe the solution you'd like

Add Sourmash 4.0: https://anaconda.org/bioconda/sourmash
(released ~1 week ago)

Describe alternatives you've considered

Could stay with current Sourmash but this is the future!!

Additional context

Add option to query newly created signatures with "sourmash search"

With sourmash index (#115) it would be convenient to make an index and then automatically search using query signatures.

TEMPLATE needs cleaning

Sorry about this, one last issue with your TEMPLATE branch..

Creating it in #92 did work and sure enough, you are now getting automated sync PRs such as #138

However, the TEMPLATE branch should only contain commits with the pure template and not any commits with a developed pipeline on it. Typically this means that there should only be a handful of commits on that branch (eg. rnaseq TEMPLATE has 13 commits). However it seems that your TEMPLATE has most of the development history in it with 1309 commits: https://github.com/nf-core/kmermaid/commits/TEMPLATE

This doesn't prevent the sync from working, but we discovered after a lot of trial and error that this makes the automated sync PRs super difficult to merge. Because git sees that you used to have the pipeline code in TEMPLATE and then deleted it to replace it with a vanilla nf-core create output it consistently wants to delete most of your pipeline in each sync PR - not great 😓

The solution is hopefully not too bad - assuming that your TEMPLATE branch commit history does go back far enough to have vanilla commits, we can just git reset --hard <sha> to that commit and then force push, wiping all of the commit history from the branch. Then we can re-do the nf-core sync to bring TEMPLATE up to date.

Once cleaned, this problem should hopefully not happen again. You can see in #138 that we changed the automated sync procedure to create a new branch each time and form a PR from that instead of directly from TEMPLATE. This means that you can fix the merge conflicts in that PR without pushing the dev history back into TEMPLATE.

I hope this makes sense, shout if you have any questions or would like some help 👍🏻

Output "compare" into its own directory

i.e. change:

	publishDir "${params.outdir}/", mode: 'copy'

	publishDir "${params.outdir}/compare", mode: 'copy'

Separate outputs by program

Right now, all the sourmash compare outputs are flat in results but should be in results/sourmash/compare

merge aligned and unaligned k-mer signatures from single cells

add a process after making signatures, before sourmash compute

sourmash sig merge -o merged.sig sig1.sig sig2.sig
sourmash sig rename -o {cell_id}.sig merged.sig

KMERMAID: Convert all parameter docs to JSON schema

Hi!

this is not necessarily an issue with the pipeline, but in order to streamline the documentation group next week for the hackathon, I'm opening issues in all repositories / pipeline repos that might need this update to switch from parameter docs to auto-generated documentation based on the JSON schema.

This will then supersede any further parameter documentation, thus making things a bit easier :-)

If this doesn't apply (anymore), please close the issue. Otherwise, I'm hoping to have some helping hands on this next week in the documentation team on Slack https://nfcore.slack.com/archives/C01QPMKBYNR

Checklist for 1.0.0 release

@phoenixAja Looks like this is a good overview: https://nf-co.re/developers/adding_pipelines#making-the-first-release

@nf-core/core Are there any other expectations we should be aware of? e.g. is a MultiQC report expected?

Add MultiQC with `sourmash sig describe` on signatures

Also need to add this feature to multiqc ... but would be helpful to get a summary of number of kmers and such per signature

Switch to github actions at some point

Use GitHub actions for running tests instead of Travis CI as it makes it easier to track what is failing in PRs

Upgrade tqdm to 4.29.1 for tqdm.auto

As mentioned here: nteract/papermill#287

Otherwise get this error (from using #48 )

Caused by:
  Process `count_umis_per_cell (ANTOINE_LIVER)` terminated with an error exit status (1)

Command executed:

  count_umis_per_cell.py \
      --reads ANTOINE_LIVER__aligned.fastq.gz \
      --min-umi-per-cell 5001 \
      --cell-barcode-pattern '(CB|CB):Z:([ACGT]+)(\-1)?' \
      --molecular-barcode-pattern '(UB|XB):Z:([ACGT]+)' \
      --umis-per-cell ANTOINE_LIVER__n_umi_per_cell.csv \
      --good-barcodes ANTOINE_LIVER__barcodes.tsv

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/home/olga/.nextflow/assets/nf-core/kmermaid/bin/count_umis_per_cell.py", line 9, in <module>
      from tqdm.auto import tqdm
  ModuleNotFoundError: No module named 'tqdm.auto'

Add --scaled option for sketches

Right now there's only a "flat rate" option to scale the sequences by a flat number of hashes, but there should be an option to use --scaled which gets e.g. every 1/1000 hashes instead of the same number for all. This accounts for sequencing depth.

Personally, I like using log2 scaling because I think life is log2 scaled and it's much easier to increase to a "natural" amount that's larger. But I know others (@bluegenes included) prefer a little more control, e.g. --scaled 500 or --scaled 1. So we'll probably need to support both.

Possible parameter names:

--scaled_sketch
--scaled_sketch_log2

.. there's probably better options

While we're at it, should here be a non-log2 option for --log2sketchsize ? And maybe separated by underscores?

No space left on device for bam

@pranathivemuri Let me know if this is the right place to file this, or whether this should be in https://github.com/czbiohub/bam2fasta/issues/

Using bam2fasta in this pipeline, the docker image runs out of space:

[0;35m[nf-core/kmermaid] Pipeline completed with errors
ERROR ~ Error executing process > 'sourmash_compute_sketch_bam (possorted_genome_bam_molecule-dna_ksize-21_log2sketchsize-10)'

Caused by:
  Process `sourmash_compute_sketch_bam (possorted_genome_bam_molecule-dna_ksize-21_log2sketchsize-10)` terminated with an error exit status (1)

Command executed:

  sourmash compute \
    --ksize 21 \
    --dna \
     \
    --num-hashes $((2**10)) \
    --processes 32 \
    --count-valid-reads 0 \
    --line-count 350 \
     \
     \
    --save-fastas . \
     \
    --output possorted_genome_bam_molecule-dna_ksize-21_log2sketchsize-10.sig \
    --input-is-10x possorted_genome_bam.bam
  find . -type f -name "*.fasta" | while read src; do if [[ $src == *"|"* ]]; then mv "$src" $(echo "$src" | tr "|" "_"); fi done

Command exit status:
  1

Command output:
  (empty)

Command error:
  385018218it [5:24:57, 19305.21it/s]
  385020154it [5:24:57, 18632.53it/s]
  385022079it [5:24:57, 18707.53it/s]
  385023956it [5:24:57, 18336.06it/s]
  385025940it [5:24:57, 18724.69it/s]
  385027819it [5:24:57, 18663.82it/s]
  385029802it [5:24:57, 18631.38it/s]
  385031877it [5:24:57, 19141.33it/s]
  385033797it [5:24:57, 18604.88it/s]
  385035760it [5:24:58, 18900.91it/s]
  385037657it [5:24:58, 18336.40it/s]
  385039630it [5:24:58, 18558.35it/s]
  385041705it [5:24:58, 19165.53it/s]
  385043631it [5:24:58, 18932.62it/s]
  385045532it [5:24:58, 18821.15it/s]
  385047420it [5:24:58, 18601.51it/s]
  385049457it [5:24:58, 19039.14it/s]
  385051193it [5:24:58, 19747.42it/s]
  multiprocess.pool.RemoteTraceback:
  """
  Traceback (most recent call last):
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker
      result = (True, func(*args, **kwds))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 44, in mapstar
      return list(map(*args))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/cli.py", line 289, in <lambda>
      lambda x: func(x.encode('utf-8')),
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/tenx_utils.py", line 213, in bam_to_temp_fasta
      filenames = list(set(write_cell_sequences(cell_sequences, delimiter)))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/tenx_utils.py", line 249, in write_cell_sequences
      with open(filename, "a") as f:
  OSError: [Errno 28] No space left on device: '/tmp/tmpoxumgy9g/CGCCAAGAGAGTAATC-1.fasta'
  """

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/bin/sourmash", line 8, in <module>
      sys.exit(main())
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/sourmash/__main__.py", line 83, in main
      cmd(sys.argv[2:])
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/sourmash/command_compute.py", line 309, in compute
      fastas = bam2fasta_cli.convert(bam_to_fasta_args)
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/bam2fasta/cli.py", line 290, in convert
      filenames, chunksize=chunksize))))
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 320, in <genexpr>
      return (item for chunk in result for item in chunk)
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/multiprocess/pool.py", line 735, in next
      raise value
  OSError: [Errno 28] No space left on device: '/tmp/tmpoxumgy9g/CGCCAAGAGAGTAATC-1.fasta'

Work dir:
  /home/olga/pureScratch/nextflow-intermediates/1c/b4f769b576791ee0f5a78e44ad72f5

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`

 -- Check '.nextflow.log' file for details

Looking at this error (specifically the solution on GitHub here), it can be resolved by setting docker.temp = "auto" which creates a new temporary directory for every image, as specified in the Nextflow Docker Scope.

Make cartesian product of molecule, ksize, sketchsize outside of "sourmash compute" command

There seems to be something happening where while the .sig file is named with e.g molecule-dayhoff_ksize-51, but when you read in the signature, it has a totally different molecule and size configuration! I think this is coming from using each in the process, so let's use combine instead to get the cartesian product.

Add option to make per-cell bam files

Currently, if one wants to count reads with differential hashes, in genes, one needs to grep/search the ENTIRE 22-gigabyte channel bam file for one single cell (out of ~700,000), which is extremely inefficient. So let's do this work up fron.
After filtering for the good barcodes, then add the option to create per-cell bam files which are useful for nf-predictorthologs.

script:
barcode_pattern = "CB:Z:${cell_barcode}-1|XC:Z:${cell_barcode}" 
"""
samtools view ${channel_bam} \\
  | rg --threads ${task.cpus}  '${barcode_pattern}' - \\
  | cat ${header_sam} - \\
  | samtools view -Sb > ${cell_barcode_bam}
"""

@lekhakaranam may be a good feature to add after the template merge (#93 )

tests take too long to run

tests with travis take multiple days to run

Actually lint the pipeline code

Basically, uncomment out this line in .travis.yml and fix all the errors:

  # - nf-core lint ${TRAVIS_BUILD_DIR}

Pipeline has 2 releases, but default branch is dev

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

Command line:
See error:

Expected behaviour

System

Hardware:
Executor:
OS:
Version

Nextflow Installation

Version:

Container engine

Engine:
version:
Image tag:

Additional context

Add sortMeRNA for removing ribosomal reads

Doing differential hash expression on the k-mer sketches generated from kmermaid results in a bunch of ribosomal genes and other sequencing-depth-dependent things. So let's add the option to remove ribosomal reads.

nf-core / kmermaid Goto Github PK

kmermaid's Introduction

Introduction

Quick Start

Documentation

Usage

With a samples.csv file

With R1, R2 read pairs

With SRA ids

With fasta files

With bam file

With split kmer

Credits

Contributions and Support

Citation

kmermaid's People

Contributors

Stargazers

Watchers

Forkers

kmermaid's Issues

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Description of the bug

Steps to reproduce

Expected behaviour

System

Nextflow Installation

Container engine

Additional context

Recommend Projects

Recommend Topics

Recommend Org