nbisweden / ameta Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 14.0 10 MB

Ancient microbiome snakemake workflow

License: MIT License

Shell 13.00% Python 38.98% CSS 1.45% R 15.92% HTML 30.65%

ameta's People

Contributors

Stargazers

Watchers

Forkers

emrahkirdok adrianforsythe patriciagonzalez zoepochon kursatozer leandroritter lvelosuarez nslb aliciamurielg genstatbionfo1078 alienzj mahesh-panchal markella-moraitou freyastein

ameta's Issues

Add support for sample include/exclude

Configuration schema defines section samples: include/exclude, but as of yet no action is taken for these settings. The configuration keys consist of sample ids and would be used to subset samples from samplesheet to restrict analyses to selected samples.

Adapt number of threads in authentication script

In scripts/authentic.sh the number of threads is hardcoded as 4, guess this should be passed as parameter from the -j option of the Snakemake invocation?

no full pathogen list pathogensFound.very_inclusive.tab is provided in aMeta repo

We need to include the full pathogensFound.very_inclusive.tab file to aMeta/resources. The file is not heavy and can be downloaded from https://doi.org/10.17044/scilifelab.21185887. Currently, there is a short version of that file in ./test/resources, but we need to provide the whole list of pathogens for the users for them to run it on real projects

Add flag --number 100000 to the PMDtools rule

If there are too many reads, the PMDtools rule gives this error:
Traceback (most recent call last):
File "/proj/nobackup/metagenomics/pochonz/scripts/pmdtools.0.60.py", line 944, in
L_D = L_D * L_match(z,ancient_model_deam,qualsrev,options.polymorphism_ancient)
File "/proj/nobackup/metagenomics/pochonz/scripts/pmdtools.0.60.py", line 148, in L_
match
P_error= phred2prob( (ord(fquals[fposition])-33))/3.0
KeyboardInterrupt

But this can be fixed by limiting the number of reads used to e.g. 100000 using the option --number for PMDtools.

sample names and fastqc

fastqc uses everything up to the fastq suffix as sample name. Unless the prefix matches the sample name, the fastqc rules will fail with a missing output file exception. Unfortunately, this behaviour cannot be changed according to a fastqc thread I read somewhere (can't find the link now). One solution could be to rename the directory, but I think one also would have to modify the contents of the fastqc output (not sure about this).

One sample can also be linked to multiple fastq files, which means 1) there is a need for a runinfo file (linking a sample to multiple fastq files) and 2) a treatment of qc output at the run level, merging bam files post-alignment after which sample-level processing continues.

The Malt_QuantifyAbundance rule takes la long time to run

As @ZoePochon mentioned on slack, this rule takes sometimes day to run. It is a simple bash loop where tax IDs are grepped out of gzipped sam files. It might become unpractical to run as the number of tax IDs grows.

Another issue we noted while looking at this, the grepping is done with zgrep \"|tax|$i\" ... which will make it so that, e.g. tax ID 123 will match |tax|123 but also |tax|12345 and so on.

I propose we redo this counting step in python, so that all tax IDs are counted at once while doing a single pass of the sam file (instead of, say, 100 grep operations when looking for 100 tax IDs)

skip adapter removal step

I got a few requests on adding an option to skip adapter removal step because users typically have fastq-files with adapters already removed. I tried to persuade them to still run the adapter removal and not skip it because from my experience there are always adapter left overs present and removing adapters one more time does not hurt but is only beneficial. I added this to FAQ in the README, but we should think about adding the skipping adapter removal option anyway.

Erroneous combinations of taxid and refids

taxid and refids show up in wrong combinations. The culprit is the following code snippet:

for tid in taxid:
        wc = Wildcards(fromdict={"sample": wildcards.sample, "taxid": tid})
        _refid = get_ref_id(wc)
        if _refid is not None:
            refid.append(_refid)
    if len(refid) > 0:
        res = expand(fmt, zip, sample=sample, taxid=taxid, refid=refid)
    return res

In case _refid is None, the taxid and refid lists will be of unequal length, causing erroneous combinations.

filter_krakenuniq, krakenuniq2krona, krakenuniq_abundance_matrix scripts overlap

All these scripts perform a filtering step on krakenuniq.output where hits are filtered by breadth/depth of coverage, the parameters are the same. It would be good to isolate this part of the processing to a separate, common script for better maintainability

add nextera adapter removal

I realized that we are by default removing only illumina adapters. Would be good to add a rule that can figure out from the FastQC report whether it is illumina or nextera adapters and if it is nextera then it removes the adapters with:

cutadapt -a CTGTCTCTTATA --minimum-length 30 -o sample.trimmed.fastq.gz sample.merged.fastq.gz -j 10

Pathogen-only projects

Sometimes we just want to look for potential pathogens for side-projects where there is no specific interest in the microbiome or when demographics is more the focus than microbes. In this cases it would be more efficient if we could create a list of about 100 most interesting pathogens and only run MALT on individuals of interest to reduce the core-hours use and the storage usage and make the pipeline faster.

Authentication might fail depending on MaltExtract output

I am looking into the authentication scripts and there is one step I am wondering about:

head -2 $OUT_DIR/${RMA6}_MaltExtract_output/default/readDist/*.rma6_additionalNodeEntries.txt | tail -1 | cut -d ';' -f2 | sed 's/'_'/''/1' > $OUT_DIR/name.list

If I understand correctly this should concatenate all additionalNodeEntries.txt files and pick from the last one the name of the reference sequence. This however fails at times, for example on Rackham there is this test:

ancient_microbiome_workflow/test_results/results_heavy_Pict_data/AUTHENTICATION/bal003

where for taxID 374840 the additionalNodeEntries file looks like this:

TargetNode	01	02	03	04	05	06	07	08	09	10
Enterobacteria_phage_phiX174_sensu_lato	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

where the reference ID is unavailable, then the head -2 ... operation gives unpredictable results and some authentication outputs are missing.

In order to address this issue, I am wondering whether it is really necessary to extract the reference sequence ID in this case? Is the additionalNodeEntries file always containing only two lines? In that case it might be enough to use the taxID alone.

A follow-up question on this is actually about file naming, which is a bit convoluted I think. For example in the previous case we have a folder called:

AUTHENTICATION/bal003/374840/bal003.trimmed.rma6_MaltExtract_output/default/readDist/bal003.trimmed.rma6_additionalNodeEntries.txt

Where the name of the sample bal003 is repeated three times in the filepath. Would it not be enough to have the sampleID name only in the first subfolder, like so:

AUTHENTICATION/bal003/374840/MaltExtract_output/default/readDist/additionalNodeEntries.txt

Some interesting microbes are ignored because they are labelled as "no rank" in the ncbi database

Human parvovirus B19, Influenza A virus, and many viruses are classified as "no rank" in the NCBI database. This is problematic because we filter the KrakenUniq output to only keep the "species" level, excluding de facto the viruses from "no rank". I would suggest to keep "no rank" in the pipeline all along because I imagine it to be trickier to custom such a large database. What do you think ?

Human parvovirus B19:
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=10798&lvl=3&lin=f&keep=1&srchmode=1&unlock
Influenza A virus:
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=382835&lvl=3&lin=f&keep=1&srchmode=1&unlock

Make sure pathogens are detected for mapdamage to run

Currently no pathogens are detected, which has to do with test data setup. Make sure there is output data for mapdamage to work on. One could possibly also rewrite the rule such that mapdamage will write empty output files and warn that no pathogens were found.

cannot extract refid

When executing runtest.sh of the workflow from the .test directory, I got two warnings:

No such file results/AUTHENTICATION/bar/632/bar.trimmed.rma6_MaltExtract_output/default/readDist/bar.trimmed.rma6_additionalNodeEntries.txt; cannot extract refid

No such file results/AUTHENTICATION/foo/632/foo.trimmed.rma6_MaltExtract_output/default/readDist/foo.trimmed.rma6_additionalNodeEntries.txt; cannot extract refid

they are not critical but still would be good to understand why. The thing is that the "*additionalNodeEntries.txt" files actually exist. So I am not sure why this warning. Otherwise all the rest works without problems.

envmodules configuration sections should follow conda environment files

Currently, there is a separate envmodules key for every rule, but a given conda environment file is shared between rules. Since both keywords solve the same problem, envmodules should be shared between rules, following the conda environment sharing. For instance, envs/malt.yaml is shared between four rules, each of which has a separate envmodules config.

Malt_Extract fails due to missing entries in custom ncbi.map

MaltExtract fails when setting the option -r to a custom ncbi.map. If missing, MaltExtract will download the required data automatically.

Options or instructions for users with PE sequencing data

Right now the workflow (due to aDNA traditions) is SE oriented. Typically users have PE data. The step of merging overlapping PE reads is not currently implemented (because in Anders' project all fastq-files were already merged by the aDNA SciLifeLab facility) but perhaps should be with e.g. fastp. Perhaps the whole Cutadapt step can be replaced with fastp?

checkpoint rules produce different amount of outputs depending on target type

checkpoint rules behave somewhat erratically and seem to rerun targets when there should be no need. For instance, there are log files where rule Make_Node_List is being rerun, even though the output file already exists. It is triggered by the checkpoint action itself, presumably because the output is a directory. Changing the output to a file (e.g. dir/.done) eliminates this behaviour.

As an example, here is a summary of rules and number of times they should be run from a recent test run on the authentication step where the input to Make_Node_List (output from checkpoint Extract_TaxIDs) is a directory:

Job stats:
job                         count    min threads    max threads
------------------------  -------  -------------  -------------
Authentication_Plots           17              1              1
Breadth_Of_Coverage             3              1              1
Deamination                    17              1              1
Make_Node_List                  5              1              1
Malt_Extract                    5              4              4
PMD_scores                     15              1              1
Post_Processing                33              4              4
Read_Length_Distribution       10              1              1
aggregate                       2              1              1
all                             1              1              1
total                         108              1              4

compared to where the input is a hidden file in that directory:

Job stats:
job                         count    min threads    max threads
------------------------  -------  -------------  -------------
Authentication_Plots           16              1              1
Deamination                    16              1              1
PMD_scores                     14              1              1
Post_Processing                32              4              4
Read_Length_Distribution        9              1              1
aggregate                       2              1              1
all                             1              1              1
total                          90              1              4

In the latter case, Make_Node_List is not triggered to rerun, which is the expected behaviour.

Malt_quantify_abundance rule broken

With the new version of the pipeline, the rule is broken and doesn't give any output.
Tried running it on the shell but it wouldn't work because the output folder didn't exist.
TJ noticed another problem with the command:
taxid = split_read[2].split("|")[-1]
This caused all read counts to be 0, due to an extra "|" at the end of the split_read[2] string.
His fix was to change it to taxid = split_read[2][:-1].split("|")[-1]

No validator found for JSON Schema version identifier

When I try to run command:

./runtest.sh -j 20

I obtain:

No validator found for JSON Schema version identifier 'http://json-schema.org/draft/2020-12/schema#'
Defaulting to validator for JSON Schema version 'https://json-schema.org/draft/2020-12/schema'
Note that schema file may not be validated correctly.
Excluding samples 'foobar' from analysis
Restricting analysis to samples 'foo','bar'
No validator found for JSON Schema version identifier 'http://json-schema.org/draft/2020-12/schema#'
Defaulting to validator for JSON Schema version 'https://json-schema.org/draft/2020-12/schema'
Note that schema file may not be validated correctly.
Changing directory from /net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test to /net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/workflow
Changing directory back to /net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test
Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
/net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test/results/KRAKENUNIQ_ABUNDANCE_MATRIX
/net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test/results/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_absolute_abundance_heatmap.pdf

Do you have any idea how to resolve it?

Option to choose between filtering by TaxReads or Reads

We are currently filtering by TaxReads, and most of the time Reads and TaxReads can have quite similar values. But sometimes, a lot of reads are assigned more specifically to a lower level called "sequence" in KrakenUniq. It is a lot the case with viruses where the genotype can be quite different than the complete reference genomes used as a reference at the "species" level.
We would need to create an option where we can choose to filter by Reads instead of TaxReads. In the end that filtering is necessary to have enough reads aligned for the deamination pattern but this can be the case if we align to the right reference genome by hand later on.

Here is an example (without the real virus name, sorry):
Percentage Reads TaxReads Kmers Dup Coverage Taxid Taxon Name
9.799e-05 522 64 1339 5.68 0.00129 10376 species Mysterious virus
7.79e-05 415 415 30 151 0.04601 1004844516 sequence Mysterious virus isolate 2, partial genome
4.693e-06 25 25 29 13.8 0.03448 1004844718 sequence Mysterious virus isolate 3, partial genome
1.502e-06 8 8 32 6.69 0.08312 1004842593 sequence Mysterious virus, cell line 5
1.126e-06 6 6 16 5.25 0.1103 1004844670 sequence Mysterious virus isolate 4, partial genome
3.754e-07 2 2 31 1.55 0.08052 1004844691 sequence Mysterious virus isolate 5, partial genome
3.754e-07 2 2 21 2 0.01246 1004844628 sequence Mysterious virus isolate 6, partial genome

Remove trailing sam files

Rule Breadth_Of_Coverage generates both bam and sam output with redundant information AFAICT. Since sam format is not very space efficient, this output should be removed at some stage.

Move module configuration to a separate config file

As I adapt the workflow to Kebnekaise, I can see that the config file is becoming quite long because of the envmodules section:

envmodules:
  FastQC_BeforeTrimming:
    - FastQC/0.11.9-Java-11
  FastQC_AfterTrimming:
    - FastQC/0.11.9-Java-11
  MultiQC_BeforeTrimming:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - MultiQC/1.10.1
  MultiQC_AfterTrimming:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - MultiQC/1.10.1
  MultiQC:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - MultiQC/1.10.1      
  Cutadapt_Adapter_Trimming:
    - GCCcore/10.2.0
    - cutadapt/3.4
  Bowtie2_Index:
    - GCC/10.2.0
    - Bowtie2/2.4.2
  Bowtie2_Pathogenome_Alignment:
    - GCC/10.2.0
    - Bowtie2/2.4.2
    - SAMtools/1.12
  MapDamage:
    - GCCcore/10.2.0    
    - parallel/20210322
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - R/4.0.4
    - mapDamage/2.2.1-R-4.0.4
    - SAMtools/1.12
  KrakenUniq:
    - GCC/10.2.0
    - KrakenUniq/0.6
  Filter_KrakenUniq_Output:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - R/4.0.4
  KrakenUniq2Krona:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - R/4.0.4
    - GCCcore/10.2.0
    - KronaTools/2.8
  KrakenUniq_AbundanceMatrix:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - R/4.0.4
  Build_Malt_DB:
    - GCC/10.2.0
    - seqtk/1.3
  Malt_AbundanceMatrix:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - R/4.0.4
  Authentication:
    - GCC/10.2.0
    - OpenMPI/4.0.5
    - R/4.0.4
    - GCC/10.2.0
    - seqtk/1.3
    - SAMtools/1.12

I believe that this part of the configuration should be moved to an other YAML file so that the same project-specific config file can be used in either environment (e.g. UPPMAX or Kebnekaise) without changes, while the new system-specific YAML can be written once and then "forgotten"

Java heap space error on MaltExtract

I've gotten a java heap space error with MaltExtract. We need a way to easily configure memory usage. Maybe add an -Xmx flag to the MaltExtract rule?

A rule is using pathogen specific file for filtering which should not be the case.

Create_Sample_TaxID_Directories creates only directories for potential pathogens using the taxID.pathogens file instead of a file containing all taxID from the krakenuniq.output.filtered file.

Explanation of the problem:
However, since aMeta performs all the heavy steps with all organisms, I think it is suboptimal to generate the final graphs only for potential pathogens instead of all organisms. It is the Malt step that takes a lot of memory and time, but not the subsequent steps. We decided to keep all organisms for the Malt alignment to reduce misalignments, which makes sense, but then we should continue to use all organisms for the final authentication steps, which would allow this pipeline to be used for all metagenomic analyses, not just pathogen analyses. There is a lot of demand for general metagenomics, and I told people that aMeta could be used for all organisms because that is how I had used it with bash scripts before. But when I looked closely at the rules, I realized this problem. It needs to be fixed quickly because people are already testing the pipeline for general metagenomics.

Suggestion for fixing it:

It's okay to continue creating the file krakenuniq.output.pathogens
Nevertheless, the Filter_KrakenUniq_output step should also create a file called taxID.organisms containing all the taxID of organisms present in krakenuniq.output.filtered.
Then the checkpoint Create_Sample_TaxID_Directories should create directories for each organisms for each sample, using this taxID.organisms file instead of the taxID.pathogens file.

seqtk subseq is slow on large databases

In the authentication pipeline we have:
seqtk subseq {params.malt_fasta} {output.name_list} > {output.fasta}

This will take a while (I stopped the test after a few minutes) when running on a full-blown DB, and would be repeated for all tax IDs found for each sample.

The help for seqtk subseq mentions:

Note: Use 'samtools faidx' if only a few regions are intended.

So maybe we could do just that? In which case we would need to add the constraint that the malt_fasta DB is indexed

Parameter `in_dir` is unused in `scripts/authentic.R`

Should be removed and the rules simplified

Deamination takes too much time for large samples

If I am not mistaken, the Deamination rule solely creates a deamination plot. However, it can take more than 24h to run for large samples.
I think we need to reintroduce the option "number" here in the pmdtools command of this rule.

Suggestion:
--number 2000000

Should be sufficient to create a smooth plot according to colleagues.

Error: duplicates in sam-header when running Bowtie2 on full NT

When following up KrakenUniq hits with Bowtie2 on full NT and not on the PathoGenome (default), there is an error originating from duplicate records in the sam-header. One needs to grep the duplicates, clean the header, and replace it in the sam-file. I have the codes that do it, and will push a commit that fixes this issue soon.

envmodules section is missing in Authentication rule

in workflow/rules/malt.smk

Update workflow report

Currently the workflow report consists of the workflow rulegraph and the default content (statistics, configuration). Things to add could be:

krakenuniq summary
authentication results
...

Usage of default MALT parameters

Currently the workflow is set up to run MALT with default parameters; whereas during testing parameters were changed to match those used in HOPS, etc. Since this may cause results to differ, this should be examined more closely. It could be that the parameters used in HOPS are better to use.

It may be better for the MALT rule have parameters explicitly stated, or is there a way to change or submit new parameters through the config file? This would allow for easier testing of different parameters.

Use rule names verbatim in messages

Resources are assigned to rules by using the rule name verbatim in combination with a specific resource. To avoid confusion, I propopse we use verbatim rule names in the messages. For example, change

"QUANTIFYING MICROBIAL ABUNDANCE USING MALT SAM-ALIGNMENTS FOR SAMPLE {input.sam}"

to something like

"Malt_QuantifyAbundance: QUANTIFYING MICROBIAL ABUNDANCE USING MALT SAM-ALIGNMENTS FOR SAMPLE {input.sam}"

Move github workflow test tweaks to separate installation script

There are currently three post-installation configuration steps that are required to make github actions work:

building of krakendb database
building of krona taxonomy
reducing malt runtime memory

These steps should be placed in a separate post-(test?)-installation script as these steps are also required for local tests but currently not implemented.

Impossible to give more threads to the rule "KrakenUniq2Krona"

So I have noticed that whatever I change in the config file, I cannot allocate more cpu than 1 to the rule "KrakenUniq2Krona". I suspect that this is because their is another rule called KrakenUniq that I encounter this problem. Not very sure why otherwise cause it's the only rule for which I have this problem.

Collect more metrics in multiqc

In addition to fastqc, multiqc also has support for cutadapt, bowtie2, so add these to multiqc dependencies. Also, collect before/after trimming results in one output?

Add pysam to the workflow/envs/mapdamage.yaml

As MinLuke reported in issue #131, newly created conda environments don't work on the mapDamage step because they don't find pysam.

A quick fix is to add pysam in the workflow/envs/mapdamage.yaml like this:

channels:

conda-forge
bioconda
defaults
dependencies:
htslib<1.15.1
mapdamage2
htslib<1.15.1
parallel
samtools
pysam

worked for him and for me

Remove hard-coded paths from authentication scripts and parametrize them in configuration

Currently authenticate.sh and authenticate.py contain hard-coded paths that should be removed.

mapdamage rule fails when no pathogen present

If no pathogenic microbes are listed in results/KRAKENUNIQ/{sample}/taxID.pathogens the MapDamage rule fails. Perhaps just a simple if-statement would help to skip this rule entirely in case the results/KRAKENUNIQ/{sample}/taxID.pathogens is empty? Like this:

shell:
    "mkdir {output.dir}; "
    "if [ -s {input.pathogen_tax_id} ]; then "
    'cat {input.pathogen_tax_id} | parallel -j {threads} "grep -w {{}} {params.pathogenome_path}/seqid2taxid.pathogen.map | cut -f1 > {output.dir}/{{}}.seq_ids" ; '
    "for i in $(cat {input.pathogen_tax_id}); do xargs --arg-file={output.dir}/${{i}}.seq_ids samtools view -bh {input.bam} --write-index -@ {threads} -o {output.dir}/${{i}}.tax.bam; done >> {log} 2>&1; "
    "find {output.dir} -name '*.tax.bam' | parallel -j {threads} \"mapDamage {params.options} -i {{}} -r {params.PATHO_DB} --merge-reference-sequences -d {output.dir}/mapDamage_{{}}\" >> {log} 2>&1 || true; "
    "for filename in {output.dir}/*.tax.bam; do newname=`echo $filename | sed 's/tax\.//g'`; mv $filename $newname; done >> {log} 2>&1; "
    "mv {output.dir}/mapDamage_{output.dir}/* {output.dir} >> {log} 2>&1; "
    "rm -r {output.dir}/mapDamage_results >> {log} 2>&1; "
    "else echo NO MICROBES TO AUTHENTICATE > {output.dir}/README.txt; fi"

problem with MapDamage

Hello everyone,
I am currently running aMeta on some sample but I get an error message when it starts to compute with MapDamage.
It seems that the problem is that it is not able to import 'pysam' module but if I enter in python and writing 'import pysam' it works correctly.
I attach here the log file, in case someone have the solution:

[main_samview] region "KY933470.1" specifies an invalid region or unknown reference. Continue anyway.
[main_samview] region "KC118516.1" specifies an invalid region or unknown reference. Continue anyway.
Error: Could not import required module 'pysam':
	- libhts.so.2: cannot open shared object file: No such file or directory
       If module is not installed, please download from 'http://code.google.com/p/pysam/'.
       A local install may be performed using the following command:
       $ python setup.py install --user

Thank you

Rule PMD_score failing with no error message

The Rule PMD_score is failing sometimes with no error message.

Log file :

[Wed Jul 20 18:08:30 2022]
Job 0: PMD_scores: COMPUTING PMD SCORES

Activating environment modules: GCC/10.2.0, OpenMPI/4.0.5, R/4.0.4, seqtk/1.3, SAMtools/1.12, MALT/0.5.3, HOPS/0.33, MEGAN/6.21.18, PMDtools/0.60
The following modules were not unloaded:
(Use "module --force purge" to unload all):

snicenvironment 2) systemdefault
[Wed Jul 20 18:09:02 2022]
Error in rule PMD_scores:
jobid: 0
output: results/AUTHENTICATION/Gok2cop_CGATGT_L007_merged_001.130313_SN344_0244_AC1U02ACXX/305/LN899819.1/305.PMDscores.txt
log: logs/PMD_SCORES/Gok2cop_CGATGT_L007_merged_001.130313_SN344_0244_AC1U02ACXX_305_LN899819.1.log (check log file(s) for error message)
shell:
samtools view -h results/AUTHENTICATION/Gok2cop_CGATGT_L007_merged_001.130313_SN344_0244_AC1U02ACXX/305/LN899819.1/305.sorted.bam | pmdtools --number 100000 --printDS > results/AUTHENTICATION/Gok2cop_CGATGT_L007_merged_001.130313_SN344_0244_AC1U02ACXX/305/LN899819.1/305.PMDscores.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job PMD_scores since they might be corrupted:
results/AUTHENTICATION/Gok2cop_CGATGT_L007_merged_001.130313_SN344_0244_AC1U02ACXX/305/LN899819.1/305.PMDscores.txt
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Post_Processing rule fails due to missing input data?

Sometimes rule Post_Processing fails with the following error message:

Read 1 item
Error in image.default(x = 1:ncol(red.res), y = 1:nrow(red.res), z = t(red.res),  : 
  increasing 'x' and 'y' values expected
image.default
Execution halted

There are no downstream rules that depend on Post_Processing output so better error handling would let the workflow run to completion, which is preferred.

krakenuniq-build v.0.7.3 cannot find jellyfish

when we switched from KrakenUniq v.0.5.8 to v.0.7.3, the new version of KrakenUniq cannot automatically detect jellyfish:

(ancient_microbiome_workflow) nikolay@dell:~/WABI/A_Gotherstrom/ancient-microbiome-smk/.test$ krakenuniq-build --db resources/KrakenUniq_DB --kmer-len 21 --minimizer-len 11
Kraken build set to minimize disk writes.
Finding all library files
Found 1 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using /home/nikolay/miniconda3/envs/ancient_microbiome_workflow/libexec/jellyfish-install/bin/jellyfish
Hash size not specified, using '9951131'
/home/nikolay/miniconda3/envs/ancient_microbiome_workflow/libexec/build_db.sh: line 46: /home/nikolay/miniconda3/envs/ancient_microbiome_workflow/libexec/jellyfish-install/bin/jellyfish: No such file or directory
xargs: cat: terminated by signal 13

Authentication plots pdf files are too heavy

The authentication plots pdf files are just so heavy that they make xquartz or slack fail when trying to open them.
It makes them not easy to share. Maybe we could have a jpg format as well that would require less space ? I suppose they take up quite some space as well.

MapDamage is only applied to pathogens

As in the title, the rule MapDamage is only applied to pathogens. This is less problematic because it is a side branch of the pipeline for fast analysis, but if we want the pipeline to be used for metagenomics of all samples we need to be more inclusive here.

RScript fails with one sample

Hi all,
I've since fixed this by including a second sample, but as an FYI the Rscript "malt_abundance_matrix.R" fails when running the aMeta pipeline with only one sample:

Error in rule Malt_AbundanceMatrix_Sam:
    jobid: 11
    input: results/MALT_QUANTIFY_ABUNDANCE/S30455/sam_counts.txt
    output: results/MALT_ABUNDANCE_MATRIX_SAM, results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt
    log: logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log (check log file(s) for error message)
    conda-env: /scratch/06909/cbscott/aMeta/.snakemake/conda/7ce3141b3314e1a9f5b4e35c7326cac1_
    shell:
        Rscript /scratch/06909/cbscott/aMeta/workflow/scripts/malt_abundance_matrix.R results/MALT_QUANTIFY_ABUNDANCE results/MALT_ABUNDANCE_MATRIX_SAM &> logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Log error message:

Error in colnames<-(*tmp*, value = "S30455") :
attempt to set 'colnames' on an object with less than two dimensions
Execution halted

Specifically, in the Rscript:
`merged<-Reduce(cbind,df)

colnames(merged)<-gsub(".sam_counts","",files)`

will return a vector rather than a data frame and produce the error.

Warning regarding path for log file

It seems that this command in the rule Bowtie2_Index in align.smk
log:
f"logs/BOWTIE2_BUILD/{config['bowtie2_patho_db']}.log",

generates always the same warning when running the pipeline:
File path logs/BOWTIE2_BUILD//proj/nobackup/metagenomics/databases/PathoGenome/library.pathogen.fna.log contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
File path logs/BOWTIE2_BUILD//proj/nobackup/metagenomics/databases/PathoGenome/library.pathogen.fna.log contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.

This is what I have in my config file:
bowtie2_patho_db: /proj/nobackup/metagenomics/databases/PathoGenome/library.pathogen.fna

This would be easily fixable by getting rid of the "/" in the command itself, but I also think that the generated path is just too long for a log file no ?

Need to add -threads: 1 to all rules as a default

Sometimes when we run a one-threaded rule, it fails because it only uses one core and needs more memory than the one that core contains (KrakenUniq2krona, aggregate, Deamination, etc). In order to be able to give more memory to that rule we need to be able to give it more cores even if it won't use several threads. To do so, the rule need to have at least this definition in its snakemake:

threads : 1
It's a trick that makes it possible. Otherwise it just persist using 1 core, whatever we write in the slurm profile. This could be an easy fix by adding this specification to all the rules. (this was suggested by TJ and is very practical)

The authentication plots are generated with different malt outputs and the amount of reads therefore don't correspond.

In the authentication plot pdf file, some plots are generated from the rma6 files and some from the sam file. This is not ideal because the sam file contains more reads when it comes to the Histogram of PMD scores plot and the Read length distribution plot. I understand that it could make sense for the Breadth of coverage plot though because you don't want to have only unique region specific to the species covered but also conserved regions covered accross species I suppose.

Maybe it doesn't need to be changed, but there should be some way to see on the plot which ones are made from which input, like a color scheme or blocs or something. And an explanation of the difference between the inputs and why it make sense that they have more or less reads. Just an idea on how to make people understand it.