nbisweden / ameta Goto Github PK
View Code? Open in Web Editor NEWAncient microbiome snakemake workflow
License: MIT License
Ancient microbiome snakemake workflow
License: MIT License
Configuration schema defines section samples: include/exclude, but as of yet no action is taken for these settings. The configuration keys consist of sample ids and would be used to subset samples from samplesheet to restrict analyses to selected samples.
In scripts/authentic.sh
the number of threads is hardcoded as 4, guess this should be passed as parameter from the -j
option of the Snakemake invocation?
We need to include the full pathogensFound.very_inclusive.tab file to aMeta/resources. The file is not heavy and can be downloaded from https://doi.org/10.17044/scilifelab.21185887. Currently, there is a short version of that file in ./test/resources, but we need to provide the whole list of pathogens for the users for them to run it on real projects
If there are too many reads, the PMDtools rule gives this error:
Traceback (most recent call last):
File "/proj/nobackup/metagenomics/pochonz/scripts/pmdtools.0.60.py", line 944, in
L_D = L_D * L_match(z,ancient_model_deam,qualsrev,options.polymorphism_ancient)
File "/proj/nobackup/metagenomics/pochonz/scripts/pmdtools.0.60.py", line 148, in L_
match
P_error= phred2prob( (ord(fquals[fposition])-33))/3.0
KeyboardInterrupt
But this can be fixed by limiting the number of reads used to e.g. 100000 using the option --number for PMDtools.
fastqc uses everything up to the fastq suffix as sample name. Unless the prefix matches the sample name, the fastqc rules will fail with a missing output file exception. Unfortunately, this behaviour cannot be changed according to a fastqc thread I read somewhere (can't find the link now). One solution could be to rename the directory, but I think one also would have to modify the contents of the fastqc output (not sure about this).
One sample can also be linked to multiple fastq files, which means 1) there is a need for a runinfo file (linking a sample to multiple fastq files) and 2) a treatment of qc output at the run level, merging bam files post-alignment after which sample-level processing continues.
As @ZoePochon mentioned on slack, this rule takes sometimes day to run. It is a simple bash loop where tax IDs are grepped out of gzipped sam files. It might become unpractical to run as the number of tax IDs grows.
Another issue we noted while looking at this, the grepping is done with zgrep \"|tax|$i\" ...
which will make it so that, e.g. tax ID 123 will match |tax|123
but also |tax|12345
and so on.
I propose we redo this counting step in python, so that all tax IDs are counted at once while doing a single pass of the sam file (instead of, say, 100 grep operations when looking for 100 tax IDs)
I got a few requests on adding an option to skip adapter removal step because users typically have fastq-files with adapters already removed. I tried to persuade them to still run the adapter removal and not skip it because from my experience there are always adapter left overs present and removing adapters one more time does not hurt but is only beneficial. I added this to FAQ in the README, but we should think about adding the skipping adapter removal option anyway.
taxid and refids show up in wrong combinations. The culprit is the following code snippet:
for tid in taxid:
wc = Wildcards(fromdict={"sample": wildcards.sample, "taxid": tid})
_refid = get_ref_id(wc)
if _refid is not None:
refid.append(_refid)
if len(refid) > 0:
res = expand(fmt, zip, sample=sample, taxid=taxid, refid=refid)
return res
In case _refid is None, the taxid and refid lists will be of unequal length, causing erroneous combinations.
All these scripts perform a filtering step on krakenuniq.output
where hits are filtered by breadth/depth of coverage, the parameters are the same. It would be good to isolate this part of the processing to a separate, common script for better maintainability
I realized that we are by default removing only illumina adapters. Would be good to add a rule that can figure out from the FastQC report whether it is illumina or nextera adapters and if it is nextera then it removes the adapters with:
cutadapt -a CTGTCTCTTATA --minimum-length 30 -o sample.trimmed.fastq.gz sample.merged.fastq.gz -j 10
Sometimes we just want to look for potential pathogens for side-projects where there is no specific interest in the microbiome or when demographics is more the focus than microbes. In this cases it would be more efficient if we could create a list of about 100 most interesting pathogens and only run MALT on individuals of interest to reduce the core-hours use and the storage usage and make the pipeline faster.
I am looking into the authentication scripts and there is one step I am wondering about:
head -2 $OUT_DIR/${RMA6}_MaltExtract_output/default/readDist/*.rma6_additionalNodeEntries.txt | tail -1 | cut -d ';' -f2 | sed 's/'_'/''/1' > $OUT_DIR/name.list
If I understand correctly this should concatenate all additionalNodeEntries.txt
files and pick from the last one the name of the reference sequence. This however fails at times, for example on Rackham there is this test:
ancient_microbiome_workflow/test_results/results_heavy_Pict_data/AUTHENTICATION/bal003
where for taxID 374840 the additionalNodeEntries
file looks like this:
TargetNode 01 02 03 04 05 06 07 08 09 10
Enterobacteria_phage_phiX174_sensu_lato NA NA NA NA NA NA NA NA NA NA
where the reference ID is unavailable, then the head -2 ...
operation gives unpredictable results and some authentication outputs are missing.
In order to address this issue, I am wondering whether it is really necessary to extract the reference sequence ID in this case? Is the additionalNodeEntries
file always containing only two lines? In that case it might be enough to use the taxID alone.
A follow-up question on this is actually about file naming, which is a bit convoluted I think. For example in the previous case we have a folder called:
AUTHENTICATION/bal003/374840/bal003.trimmed.rma6_MaltExtract_output/default/readDist/bal003.trimmed.rma6_additionalNodeEntries.txt
Where the name of the sample bal003
is repeated three times in the filepath. Would it not be enough to have the sampleID name only in the first subfolder, like so:
AUTHENTICATION/bal003/374840/MaltExtract_output/default/readDist/additionalNodeEntries.txt
Human parvovirus B19, Influenza A virus, and many viruses are classified as "no rank" in the NCBI database. This is problematic because we filter the KrakenUniq output to only keep the "species" level, excluding de facto the viruses from "no rank". I would suggest to keep "no rank" in the pipeline all along because I imagine it to be trickier to custom such a large database. What do you think ?
Human parvovirus B19:
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=10798&lvl=3&lin=f&keep=1&srchmode=1&unlock
Influenza A virus:
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=382835&lvl=3&lin=f&keep=1&srchmode=1&unlock
Currently no pathogens are detected, which has to do with test data setup. Make sure there is output data for mapdamage to work on. One could possibly also rewrite the rule such that mapdamage will write empty output files and warn that no pathogens were found.
When executing runtest.sh of the workflow from the .test directory, I got two warnings:
No such file results/AUTHENTICATION/bar/632/bar.trimmed.rma6_MaltExtract_output/default/readDist/bar.trimmed.rma6_additionalNodeEntries.txt; cannot extract refid
No such file results/AUTHENTICATION/foo/632/foo.trimmed.rma6_MaltExtract_output/default/readDist/foo.trimmed.rma6_additionalNodeEntries.txt; cannot extract refid
they are not critical but still would be good to understand why. The thing is that the "*additionalNodeEntries.txt" files actually exist. So I am not sure why this warning. Otherwise all the rest works without problems.
Currently, there is a separate envmodules key for every rule, but a given conda environment file is shared between rules. Since both keywords solve the same problem, envmodules should be shared between rules, following the conda environment sharing. For instance, envs/malt.yaml
is shared between four rules, each of which has a separate envmodules config.
MaltExtract fails when setting the option -r
to a custom ncbi.map. If missing, MaltExtract will download the required data automatically.
Right now the workflow (due to aDNA traditions) is SE oriented. Typically users have PE data. The step of merging overlapping PE reads is not currently implemented (because in Anders' project all fastq-files were already merged by the aDNA SciLifeLab facility) but perhaps should be with e.g. fastp. Perhaps the whole Cutadapt step can be replaced with fastp?
checkpoint rules behave somewhat erratically and seem to rerun targets when there should be no need. For instance, there are log files where rule Make_Node_List is being rerun, even though the output file already exists. It is triggered by the checkpoint action itself, presumably because the output is a directory. Changing the output to a file (e.g. dir/.done
) eliminates this behaviour.
As an example, here is a summary of rules and number of times they should be run from a recent test run on the authentication step where the input to Make_Node_List (output from checkpoint Extract_TaxIDs) is a directory:
Job stats:
job count min threads max threads
------------------------ ------- ------------- -------------
Authentication_Plots 17 1 1
Breadth_Of_Coverage 3 1 1
Deamination 17 1 1
Make_Node_List 5 1 1
Malt_Extract 5 4 4
PMD_scores 15 1 1
Post_Processing 33 4 4
Read_Length_Distribution 10 1 1
aggregate 2 1 1
all 1 1 1
total 108 1 4
compared to where the input is a hidden file in that directory:
Job stats:
job count min threads max threads
------------------------ ------- ------------- -------------
Authentication_Plots 16 1 1
Deamination 16 1 1
PMD_scores 14 1 1
Post_Processing 32 4 4
Read_Length_Distribution 9 1 1
aggregate 2 1 1
all 1 1 1
total 90 1 4
In the latter case, Make_Node_List is not triggered to rerun, which is the expected behaviour.
With the new version of the pipeline, the rule is broken and doesn't give any output.
Tried running it on the shell but it wouldn't work because the output folder didn't exist.
TJ noticed another problem with the command:
taxid = split_read[2].split("|")[-1]
This caused all read counts to be 0, due to an extra "|" at the end of the split_read[2] string.
His fix was to change it to taxid = split_read[2][:-1].split("|")[-1]
When I try to run command:
./runtest.sh -j 20
I obtain:
No validator found for JSON Schema version identifier 'http://json-schema.org/draft/2020-12/schema#'
Defaulting to validator for JSON Schema version 'https://json-schema.org/draft/2020-12/schema'
Note that schema file may not be validated correctly.
Excluding samples 'foobar' from analysis
Restricting analysis to samples 'foo','bar'
No validator found for JSON Schema version identifier 'http://json-schema.org/draft/2020-12/schema#'
Defaulting to validator for JSON Schema version 'https://json-schema.org/draft/2020-12/schema'
Note that schema file may not be validated correctly.
Changing directory from /net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test to /net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/workflow
Changing directory back to /net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test
Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
/net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test/results/KRAKENUNIQ_ABUNDANCE_MATRIX
/net/pr2/projects/plgrid/plggadna/plgpioroz/aMeta/.test/results/KRAKENUNIQ_ABUNDANCE_MATRIX/krakenuniq_absolute_abundance_heatmap.pdf
Do you have any idea how to resolve it?
We are currently filtering by TaxReads, and most of the time Reads and TaxReads can have quite similar values. But sometimes, a lot of reads are assigned more specifically to a lower level called "sequence" in KrakenUniq. It is a lot the case with viruses where the genotype can be quite different than the complete reference genomes used as a reference at the "species" level.
We would need to create an option where we can choose to filter by Reads instead of TaxReads. In the end that filtering is necessary to have enough reads aligned for the deamination pattern but this can be the case if we align to the right reference genome by hand later on.
Here is an example (without the real virus name, sorry):
Percentage Reads TaxReads Kmers Dup Coverage Taxid Taxon Name
9.799e-05 522 64 1339 5.68 0.00129 10376 species Mysterious virus
7.79e-05 415 415 30 151 0.04601 1004844516 sequence Mysterious virus isolate 2, partial genome
4.693e-06 25 25 29 13.8 0.03448 1004844718 sequence Mysterious virus isolate 3, partial genome
1.502e-06 8 8 32 6.69 0.08312 1004842593 sequence Mysterious virus, cell line 5
1.126e-06 6 6 16 5.25 0.1103 1004844670 sequence Mysterious virus isolate 4, partial genome
3.754e-07 2 2 31 1.55 0.08052 1004844691 sequence Mysterious virus isolate 5, partial genome
3.754e-07 2 2 21 2 0.01246 1004844628 sequence Mysterious virus isolate 6, partial genome
Rule Breadth_Of_Coverage generates both bam and sam output with redundant information AFAICT. Since sam format is not very space efficient, this output should be removed at some stage.
As I adapt the workflow to Kebnekaise, I can see that the config file is becoming quite long because of the envmodules
section:
envmodules:
FastQC_BeforeTrimming:
- FastQC/0.11.9-Java-11
FastQC_AfterTrimming:
- FastQC/0.11.9-Java-11
MultiQC_BeforeTrimming:
- GCC/10.2.0
- OpenMPI/4.0.5
- MultiQC/1.10.1
MultiQC_AfterTrimming:
- GCC/10.2.0
- OpenMPI/4.0.5
- MultiQC/1.10.1
MultiQC:
- GCC/10.2.0
- OpenMPI/4.0.5
- MultiQC/1.10.1
Cutadapt_Adapter_Trimming:
- GCCcore/10.2.0
- cutadapt/3.4
Bowtie2_Index:
- GCC/10.2.0
- Bowtie2/2.4.2
Bowtie2_Pathogenome_Alignment:
- GCC/10.2.0
- Bowtie2/2.4.2
- SAMtools/1.12
MapDamage:
- GCCcore/10.2.0
- parallel/20210322
- GCC/10.2.0
- OpenMPI/4.0.5
- R/4.0.4
- mapDamage/2.2.1-R-4.0.4
- SAMtools/1.12
KrakenUniq:
- GCC/10.2.0
- KrakenUniq/0.6
Filter_KrakenUniq_Output:
- GCC/10.2.0
- OpenMPI/4.0.5
- R/4.0.4
KrakenUniq2Krona:
- GCC/10.2.0
- OpenMPI/4.0.5
- R/4.0.4
- GCCcore/10.2.0
- KronaTools/2.8
KrakenUniq_AbundanceMatrix:
- GCC/10.2.0
- OpenMPI/4.0.5
- R/4.0.4
Build_Malt_DB:
- GCC/10.2.0
- seqtk/1.3
Malt_AbundanceMatrix:
- GCC/10.2.0
- OpenMPI/4.0.5
- R/4.0.4
Authentication:
- GCC/10.2.0
- OpenMPI/4.0.5
- R/4.0.4
- GCC/10.2.0
- seqtk/1.3
- SAMtools/1.12
I believe that this part of the configuration should be moved to an other YAML file so that the same project-specific config file can be used in either environment (e.g. UPPMAX or Kebnekaise) without changes, while the new system-specific YAML can be written once and then "forgotten"
I've gotten a java heap space error with MaltExtract. We need a way to easily configure memory usage. Maybe add an -Xmx flag to the MaltExtract rule?
Create_Sample_TaxID_Directories creates only directories for potential pathogens using the taxID.pathogens file instead of a file containing all taxID from the krakenuniq.output.filtered file.
Explanation of the problem:
However, since aMeta performs all the heavy steps with all organisms, I think it is suboptimal to generate the final graphs only for potential pathogens instead of all organisms. It is the Malt step that takes a lot of memory and time, but not the subsequent steps. We decided to keep all organisms for the Malt alignment to reduce misalignments, which makes sense, but then we should continue to use all organisms for the final authentication steps, which would allow this pipeline to be used for all metagenomic analyses, not just pathogen analyses. There is a lot of demand for general metagenomics, and I told people that aMeta could be used for all organisms because that is how I had used it with bash scripts before. But when I looked closely at the rules, I realized this problem. It needs to be fixed quickly because people are already testing the pipeline for general metagenomics.
Suggestion for fixing it:
In the authentication pipeline we have:
seqtk subseq {params.malt_fasta} {output.name_list} > {output.fasta}
This will take a while (I stopped the test after a few minutes) when running on a full-blown DB, and would be repeated for all tax IDs found for each sample.
The help for seqtk subseq
mentions:
Note: Use 'samtools faidx' if only a few regions are intended.
So maybe we could do just that? In which case we would need to add the constraint that the malt_fasta
DB is indexed
Should be removed and the rules simplified
If I am not mistaken, the Deamination rule solely creates a deamination plot. However, it can take more than 24h to run for large samples.
I think we need to reintroduce the option "number" here in the pmdtools command of this rule.
Suggestion:
--number 2000000
Should be sufficient to create a smooth plot according to colleagues.
When following up KrakenUniq hits with Bowtie2 on full NT and not on the PathoGenome (default), there is an error originating from duplicate records in the sam-header. One needs to grep the duplicates, clean the header, and replace it in the sam-file. I have the codes that do it, and will push a commit that fixes this issue soon.
in workflow/rules/malt.smk
Currently the workflow report consists of the workflow rulegraph and the default content (statistics, configuration). Things to add could be:
Currently the workflow is set up to run MALT with default parameters; whereas during testing parameters were changed to match those used in HOPS, etc. Since this may cause results to differ, this should be examined more closely. It could be that the parameters used in HOPS are better to use.
It may be better for the MALT rule have parameters explicitly stated, or is there a way to change or submit new parameters through the config file? This would allow for easier testing of different parameters.
Resources are assigned to rules by using the rule name verbatim in combination with a specific resource. To avoid confusion, I propopse we use verbatim rule names in the messages. For example, change
"QUANTIFYING MICROBIAL ABUNDANCE USING MALT SAM-ALIGNMENTS FOR SAMPLE {input.sam}"
to something like
"Malt_QuantifyAbundance: QUANTIFYING MICROBIAL ABUNDANCE USING MALT SAM-ALIGNMENTS FOR SAMPLE {input.sam}"
There are currently three post-installation configuration steps that are required to make github actions work:
These steps should be placed in a separate post-(test?)-installation script as these steps are also required for local tests but currently not implemented.
So I have noticed that whatever I change in the config file, I cannot allocate more cpu than 1 to the rule "KrakenUniq2Krona". I suspect that this is because their is another rule called KrakenUniq that I encounter this problem. Not very sure why otherwise cause it's the only rule for which I have this problem.
In addition to fastqc, multiqc also has support for cutadapt, bowtie2, so add these to multiqc dependencies. Also, collect before/after trimming results in one output?
As MinLuke reported in issue #131, newly created conda environments don't work on the mapDamage step because they don't find pysam.
A quick fix is to add pysam in the workflow/envs/mapdamage.yaml like this:
channels:
worked for him and for me
Currently authenticate.sh and authenticate.py contain hard-coded paths that should be removed.
If no pathogenic microbes are listed in results/KRAKENUNIQ/{sample}/taxID.pathogens
the MapDamage rule fails. Perhaps just a simple if-statement would help to skip this rule entirely in case the results/KRAKENUNIQ/{sample}/taxID.pathogens
is empty? Like this:
shell:
"mkdir {output.dir}; "
"if [ -s {input.pathogen_tax_id} ]; then "
'cat {input.pathogen_tax_id} | parallel -j {threads} "grep -w {{}} {params.pathogenome_path}/seqid2taxid.pathogen.map | cut -f1 > {output.dir}/{{}}.seq_ids" ; '
"for i in $(cat {input.pathogen_tax_id}); do xargs --arg-file={output.dir}/${{i}}.seq_ids samtools view -bh {input.bam} --write-index -@ {threads} -o {output.dir}/${{i}}.tax.bam; done >> {log} 2>&1; "
"find {output.dir} -name '*.tax.bam' | parallel -j {threads} \"mapDamage {params.options} -i {{}} -r {params.PATHO_DB} --merge-reference-sequences -d {output.dir}/mapDamage_{{}}\" >> {log} 2>&1 || true; "
"for filename in {output.dir}/*.tax.bam; do newname=`echo $filename | sed 's/tax\.//g'`; mv $filename $newname; done >> {log} 2>&1; "
"mv {output.dir}/mapDamage_{output.dir}/* {output.dir} >> {log} 2>&1; "
"rm -r {output.dir}/mapDamage_results >> {log} 2>&1; "
"else echo NO MICROBES TO AUTHENTICATE > {output.dir}/README.txt; fi"
Hello everyone,
I am currently running aMeta on some sample but I get an error message when it starts to compute with MapDamage.
It seems that the problem is that it is not able to import 'pysam' module but if I enter in python and writing 'import pysam' it works correctly.
I attach here the log file, in case someone have the solution:
[main_samview] region "KY933470.1" specifies an invalid region or unknown reference. Continue anyway.
[main_samview] region "KC118516.1" specifies an invalid region or unknown reference. Continue anyway.
Error: Could not import required module 'pysam':
- libhts.so.2: cannot open shared object file: No such file or directory
If module is not installed, please download from 'http://code.google.com/p/pysam/'.
A local install may be performed using the following command:
$ python setup.py install --user
Thank you
The Rule PMD_score is failing sometimes with no error message.
Log file :
[Wed Jul 20 18:08:30 2022]
Job 0: PMD_scores: COMPUTING PMD SCORES
Activating environment modules: GCC/10.2.0, OpenMPI/4.0.5, R/4.0.4, seqtk/1.3, SAMtools/1.12, MALT/0.5.3, HOPS/0.33, MEGAN/6.21.18, PMDtools/0.60
The following modules were not unloaded:
(Use "module --force purge" to unload all):
Removing output files of failed job PMD_scores since they might be corrupted:
results/AUTHENTICATION/Gok2cop_CGATGT_L007_merged_001.130313_SN344_0244_AC1U02ACXX/305/LN899819.1/305.PMDscores.txt
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Sometimes rule Post_Processing fails with the following error message:
Read 1 item
Error in image.default(x = 1:ncol(red.res), y = 1:nrow(red.res), z = t(red.res), :
increasing 'x' and 'y' values expected
image.default
Execution halted
There are no downstream rules that depend on Post_Processing output so better error handling would let the workflow run to completion, which is preferred.
when we switched from KrakenUniq v.0.5.8 to v.0.7.3, the new version of KrakenUniq cannot automatically detect jellyfish:
(ancient_microbiome_workflow) nikolay@dell:~/WABI/A_Gotherstrom/ancient-microbiome-smk/.test$ krakenuniq-build --db resources/KrakenUniq_DB --kmer-len 21 --minimizer-len 11
Kraken build set to minimize disk writes.
Finding all library files
Found 1 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using /home/nikolay/miniconda3/envs/ancient_microbiome_workflow/libexec/jellyfish-install/bin/jellyfish
Hash size not specified, using '9951131'
/home/nikolay/miniconda3/envs/ancient_microbiome_workflow/libexec/build_db.sh: line 46: /home/nikolay/miniconda3/envs/ancient_microbiome_workflow/libexec/jellyfish-install/bin/jellyfish: No such file or directory
xargs: cat: terminated by signal 13
The authentication plots pdf files are just so heavy that they make xquartz or slack fail when trying to open them.
It makes them not easy to share. Maybe we could have a jpg format as well that would require less space ? I suppose they take up quite some space as well.
As in the title, the rule MapDamage is only applied to pathogens. This is less problematic because it is a side branch of the pipeline for fast analysis, but if we want the pipeline to be used for metagenomics of all samples we need to be more inclusive here.
Hi all,
I've since fixed this by including a second sample, but as an FYI the Rscript "malt_abundance_matrix.R" fails when running the aMeta pipeline with only one sample:
Error in rule Malt_AbundanceMatrix_Sam:
jobid: 11
input: results/MALT_QUANTIFY_ABUNDANCE/S30455/sam_counts.txt
output: results/MALT_ABUNDANCE_MATRIX_SAM, results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt
log: logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log (check log file(s) for error message)
conda-env: /scratch/06909/cbscott/aMeta/.snakemake/conda/7ce3141b3314e1a9f5b4e35c7326cac1_
shell:
Rscript /scratch/06909/cbscott/aMeta/workflow/scripts/malt_abundance_matrix.R results/MALT_QUANTIFY_ABUNDANCE results/MALT_ABUNDANCE_MATRIX_SAM &> logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Log error message:
Error in colnames<-
(*tmp*
, value = "S30455") :
attempt to set 'colnames' on an object with less than two dimensions
Execution halted
Specifically, in the Rscript:
`merged<-Reduce(cbind,df)
colnames(merged)<-gsub(".sam_counts","",files)`
will return a vector rather than a data frame and produce the error.
It seems that this command in the rule Bowtie2_Index in align.smk
log:
f"logs/BOWTIE2_BUILD/{config['bowtie2_patho_db']}.log",
generates always the same warning when running the pipeline:
File path logs/BOWTIE2_BUILD//proj/nobackup/metagenomics/databases/PathoGenome/library.pathogen.fna.log contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
File path logs/BOWTIE2_BUILD//proj/nobackup/metagenomics/databases/PathoGenome/library.pathogen.fna.log contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
This is what I have in my config file:
bowtie2_patho_db: /proj/nobackup/metagenomics/databases/PathoGenome/library.pathogen.fna
This would be easily fixable by getting rid of the "/" in the command itself, but I also think that the generated path is just too long for a log file no ?
Sometimes when we run a one-threaded rule, it fails because it only uses one core and needs more memory than the one that core contains (KrakenUniq2krona, aggregate, Deamination, etc). In order to be able to give more memory to that rule we need to be able to give it more cores even if it won't use several threads. To do so, the rule need to have at least this definition in its snakemake:
In the authentication plot pdf file, some plots are generated from the rma6 files and some from the sam file. This is not ideal because the sam file contains more reads when it comes to the Histogram of PMD scores
plot and the Read length distribution plot. I understand that it could make sense for the Breadth of coverage plot though because you don't want to have only unique region specific to the species covered but also conserved regions covered accross species I suppose.
Maybe it doesn't need to be changed, but there should be some way to see on the plot which ones are made from which input, like a color scheme or blocs or something. And an explanation of the difference between the inputs and why it make sense that they have more or less reads. Just an idea on how to make people understand it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.