Coder Social home page Coder Social logo

curatedmetagenomicsnextflow's Introduction

README

The pipeline, in brief

  • fastq-dump
  • setup databases
    • metaphlan
    • chocophlan
    • uniref
  • metaphlan bug list
  • metaphlan markers
  • humann (including various aggregations)

Inputs and Outputs

The metadata_tsv file must be:

  • tab-separated
  • must contain columns
    • sample_id
    • study_name
    • NCBI_accession, a semicolon-separated list of SRRs
  • Can be a file or a web url

If using a Google Bucket, the name bucket must not have underscores.

On sample ids

We use a simple approach to create sample ids. The study_name and sample_id are first concatenated by ::. Then, we base64 encode. For example:

echo 'study_name1::sample_name1' | base64

This yields:

c3R1ZHlfbmFtZTE6OnNhbXBsZV9uYW1lMQo=

To decode a sample id:

echo 'c3R1ZHlfbmFtZTE6OnNhbXBsZV9uYW1lMQo=' | base64 -d

which gives back the original string:

study_name1::sample_name1

Install

export NXF_MODE=google
curl https://get.nextflow.io | bash

Google setup

You will need to be able to access google cloud storage as well as to run the Google Cloud Pipeline API. This requires credentials to do so. You can either use Google Default Application Credentials or a key file. The latter is the recommended approach. If you need a keyfile, contact the person who owns the Google Project you'll be using.

Keyfile setup

Once you get a keyfile (which is a json file), run the following:

# just examples:
export SVC_ACCOUNT='nextflow-service-account@curatedmetagenomicdata.iam.gserviceaccount.com' #example name
export GOOGLE_APPLICATION_CREDENTIALS=/data/curatedmetagenomicdata-f7fc1489b036.json
export GCP_PROJECT=curatedmetagenomicdata
gcloud auth activate-service-account \
   $SVC_ACCOUNT \
   --key-file=$GOOGLE_APPLICATION_CREDENTIALS \
   --project=$GCP_PROJECT

Execution

This assumes that you are running on Google, that credentials are set up correctly, and that you have a Google Storage Bucket already created. Note that bucket names must NOT contain the _ or other special characters.

# No '_' or other non-url-safe characters in bucket names
export GOOGLE_BUCKET_NAME='your-bucket-name'

# if bucket does not exist:
gsutil mb gs://$GOOGLE_BUCKET_NAME

You can now run test data. This will take a few hours the first time, so run on a system that will remain on during that time (laptops are not a good choice if you are going to close it and go home, for example).

./nextflow run seandavi/curatedMetagenomicsNextflow \
  -profile google \
  -work-dir gs://$GOOGLE_BUCKET_NAME/work \
  --publish_dir=gs://$GOOGLE_BUCKET_NAME/results \
  --store_dir=gs://$GOOGLE_BUCKET_NAME/store \
  -resume \
  -r main \
  --metadata_tsv https://raw.githubusercontent.com/seandavi/curatedMetagenomicsNextflow/main/samplesheet.test.tsv

To view results:

gsutil ls -larh gs://$GOOGLE_BUCKET_NAME/results

To view an individual file:

gsutil cat PATH_TO_GOOGLE_OBJECT # from above list

To cleanup:

gsutil -m rm -r gs://$GOOGLE_BUCKET_NAME

nf-core tools integration

pip install nf-core
nf-core launch seandavi/curatedMetagenomicsNextflow

curatedmetagenomicsnextflow's People

Contributors

jwokaty avatar seandavi avatar shbrief avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

curatedmetagenomicsnextflow's Issues

Additional Documentation

Hi Sean,

I was able to run the pipeline on the sample data. You mentioned you might like help with documentation. If you could tell me what you need help with, I'd be happy to write and submit a PR.

possible to change repo name

Hi Sean, would it be ok to change the repo name to curatedMetagenomicDataNextflow? I've been trying to move everything towards consistent naming and it would help. Hope it's not too much to ask!

add weblog capability

  • -with-weblog

https://www.nextflow.io/docs/latest/tracing.html?highlight=graphviz#weblog-via-http

sends json with these keys:

runName	The workflow execution run name.
runId	The workflow execution unique ID.
event	The workflow execution event. One of started, process_submitted, process_started, process_completed, error, completed.
utcTime	The UTC timestamp in ISO 8601 format.
trace	A process runtime information as described in the trace fields section. This attribute is only provided for the following events: process_submitted, process_started, process_completed, error.
metadata	The workflow metadata including the config manifest. For a list of all fields, have a look at the bottom message examples. This attribute is only provided for the following events: started, completed.

Gene families mapping

@paolinomanghi and I were thinking about not including directly the mapped gene families as a resource for cMD. HUMAnN provides mapping for

  • EC
  • KO
  • GO
  • eggNOG
  • PFAM
    plus, provides all the descriptions for UniRef90/50 and the mapped categories.

Instead of providing these files, that should be computed for both the stratified and unstratified gene families, we thought about having a function that downloads the mapping files and another that, given the gene families (stratified or unstratified), it generates a table with the mapped gene families and/or the description.

What do you all guys think? @seandavi @lwaldron @nsegata @vjcitn

storage exhausted while writing file within file system module

Hi @seandavi,

We're trying to run the pipeline on a mouse dataset, but we're running into the following error (I've only copied the tail end of it, since there's a lot of output, but I can share more if needed):

[08/beb018] process > fasterq_dump (UFJKTkE1MDQ4NDY6OlNBTU4xMDQwNjczNQ==) [100%] 80 of 80, failed: 80, retries: 60 ✘
[skipped  ] process > install_metaphlan_db                                [100%] 1 of 1, stored: 1 ✔
[skipped  ] process > uniref_db                                           [100%] 1 of 1, stored: 1 ✔
[d5/812f32] process > chocophlan_db                                       [100%] 1 of 1 ✔
[-        ] process > metaphlan_bugs_list                                 -
[-        ] process > metaphlan_markers                                   -
[-        ] process > humann                                              -
Error executing process > 'fasterq_dump (UFJKTkE1MDQ4NDY6OlNBTU4xMDQwNjcyNg==)'

Caused by:
  Process `fasterq_dump (UFJKTkE1MDQ4NDY6OlNBTU4xMDQwNjcyNg==)` terminated with an error exit status (3)

Command executed:

  echo "accessions: [SRR8291347]" > sampleinfo.txt
  fasterq-dump           --skip-technical           --force           --threads 4           --split-files SRR8291347
  cat *.fastq | gzip > out.fastq.gz
  gunzip -c out.fastq.gz | wc -l > fastq_line_count.txt
  rm *.fastq
  seqtk sample -s100 out.fastq.gz 50000 > out_sample.fastq
  fastqc --extract out_sample.fastq
  rm out_sample.fastq

Command exit status:
  3

Command output:
  ++ stat PEAK=0 21 1003020 353560 1068572 353704 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 21 985116 343376 1052708 343576 97 3
  ++ stat SUM=0 21 1000964 353504 1068572 353704 154 3
  ++ stat PEAK=0 21 1003020 353560 1068572 353704 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 21 985116 343904 1052708 343904 97 3
  ++ stat SUM=0 21 1000964 354032 1068572 354032 154 3
  ++ stat PEAK=0 21 1003020 354032 1068572 354032 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 21 985116 344432 1052708 344432 97 3
  ++ stat SUM=0 21 1000964 354560 1068572 354560 154 3
  ++ stat PEAK=0 21 1003020 354560 1068572 354560 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 21 985116 344960 1052708 344960 97 3
  ++ stat SUM=0 21 1000964 355088 1068572 355088 154 3
  ++ stat PEAK=0 21 1003020 355088 1068572 355088 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 21 985116 345224 1052708 345224 97 3
  ++ stat SUM=0 21 1000964 355352 1068572 355352 154 3
  ++ stat PEAK=0 21 1003020 355352 1068572 355352 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 21 983060 336960 1052708 345624 97 3
  ++ stat SUM=0 21 998908 347088 1068572 355752 154 3
  ++ stat PEAK=0 21 1003020 355352 1068572 355752 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 21 983060 337224 1052708 345624 97 3
  ++ stat SUM=0 21 998908 347352 1068572 355752 154 3
  ++ stat PEAK=0 21 1003020 355352 1068572 355752 154 3
  
  ++ stat mem=37 0 5492 3224 5508 3224 1 0
  ++ stat mem=42 0 10356 6904 10356 6904 56 0
  ++ stat mem=101 20 954396 327072 1052708 345624 98 3
  ++ stat SUM=0 20 970244 337200 1068572 355752 155 3
  ++ stat PEAK=0 21 1003020 355352 1068572 355752 155 3

Command error:
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'
  2022-01-05T18:10:33 fasterq-dump.2.10.8 err: cmn_iter.c cmn_read_String( #23357410 ).VCursorCellDataDirect() -> RC(rcXF,rcFunction,rcExecuting,rcParam,rcBadVersion) 
  fasterq-dump quit with error code 3

Work dir:
  gs://shtest2022/work/ee/ea381704665e16bb3083e67553c237

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Is there a setting that we need to adjust in the config to allow it use more space OR is there an issue with fasterq-dump that we're not understanding? We'd appreciate any suggestions.

Can I run the pipeline local?

Hi :)
I'm new to nextflow. I would like to know if it is possible that I run your software locally.

I tried

nf-core launch seandavi/curatedMetagenomicsNextflow
but it comes with en error as follows:
╭─────────────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────────────╮
│ /home/lzhang/miniconda3/envs/nextflow_cM/bin/nf-core:10 in │
│ │
│ 9 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │
│ ❱ 10 │ sys.exit(run_nf_core()) │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/nf_core/main.py:62 in run_nf_core │
│ │
│ 61 │ # Lanch the click cli │
│ ❱ 62 │ nf_core_cli() │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/click/core.py:829 in call
│ │
│ 828 │ │ """Alias for :meth:main.""" │
│ ❱ 829 │ │ return self.main(*args, **kwargs) │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/click/core.py:782 in main │
│ │
│ 781 │ │ │ │ with self.make_context(prog_name, args, **extra) as ctx: │
│ ❱ 782 │ │ │ │ │ rv = self.invoke(ctx) │
│ 783 │ │ │ │ │ if not standalone_mode: │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/click/core.py:1259 in invoke │
│ │
│ 1258 │ │ │ │ with sub_ctx: │
│ ❱ 1259 │ │ │ │ │ return _process_result(sub_ctx.command.invoke(sub_ctx)) │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/click/core.py:1066 in invoke │
│ │
│ 1065 │ │ if self.callback is not None: │
│ ❱ 1066 │ │ │ return ctx.invoke(self.callback, **ctx.params) │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/click/core.py:610 in invoke │
│ │
│ 609 │ │ │ with ctx: │
│ ❱ 610 │ │ │ │ return callback(*args, **kwargs) │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/nf_core/main.py:203 in launch │
│ │
│ 202 │ ) │
│ ❱ 203 │ if launcher.launch_pipeline() == False: │
│ 204 │ │ sys.exit(1) │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/nf_core/launch.py:170 in launch_pipeline │
│ │
│ 169 │ │ │ │ # Kick off the interactive wizard to collect user inputs │
│ ❱ 170 │ │ │ │ self.prompt_schema() │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/nf_core/launch.py:396 in prompt_schema │
│ │
│ 395 │ │ │ d_key = allOf["$ref"][14:] │
│ ❱ 396 │ │ │ answers.update(self.prompt_group(d_key, │
│ self.schema_obj.schema["definitions"][d_key])) │
│ │
│ /home/lzhang/miniconda3/envs/nextflow_cM/lib/python3.6/site-packages/nf_core/launch.py:476 in prompt_group │
│ │
│ 475 │ │ │ │
│ ❱ 476 │ │ │ for param_id, param in group_obj["properties"].items(): │
│ 477 │ │ │ │ if not param.get("hidden", False) or self.show_hidden: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'properties'

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.