The nanoasv from imagoxv

Checkpoint system

I need to add a checkpoint system that could allow to resume data analysis in case of error to avoid re-computing

Same, I should add a R_ONLY option to phyloseq data in case the metadata.csv file was not working

script should test for the presence of required binaries

for example:

which mafft > /dev/null || \
    { echo "error message" ; exit 1 ; }

Tests should occur early in the script; before any time-consuming computation.

Produce phylo tree for phyloseq object

Need to singleline SILVA, directly in dockerfile because, it's the moment.

need to extract reference seq to build tree with fastTree.

Need to feed Rscript with the tree and then inject it in phyloseq object.

Almost there

Concatenation step is long, need to be executud in parrallel

title

No control over Porehcop cpu usage.

Porechop uses as many cpu as possible no matter the parallelization I wrote.

It might detect and use whatever is asked. It might even do that in parallel somehow

It seems a pretty lightweight computation still.

Subsampling before chimare detection

Chimera detection seems long for some highly sequenced barcodes.
I need to add a subsampling step before chimera detection, maybe something like --subsampling XX * 2 to allow for some buffer sequences.

Subsampling before chimera detection

Chimera detection seems long for some highly sequenced barcodes.
I need to add a subsampling step before chimera detection, maybe something like --subsampling XX * 2 to allow for some buffer sequences.

add a citation file

For a given repository, GitHub can parse and expose a recommended way to cite the repository. It only requires a CITATION.cff file. Check for instance:

https://github.com/frederic-mahe/mumu/blob/main/CITATION.cff

which is exposed as:

Mahé, F. (2023). mumu: post-clustering curation tool for metabarcoding data (Version 1.0.2) [Computer software]. https://github.com/frederic-mahe/mumu

First Test on real dataset

First test on real dataset had memory handling issues for some barcodes.
Dataset has to be subsampled

Numerical taxonomic richness seemed to have increased comparing to my previous treatment. I have to investigate.

Not running on aarch64

I cannot make it run pn MK1C because chopper was compiled for amd64

Need to find a way

Binary Realease ?

Hi @frederic-mahe, I tried to make a release to see how it works. However, my binary is too big (~5Gb). Max allowed is 2Gb.

Any idea on how to overcome this ?

Arthur

Need to specify software versions for reproducibility

Need to specify the versions in the docker file installation so it's always the same tool versions used

bwa Version: 0.7.17-r1188 (Might consider upgrading to a more recent one
Chopper v0.7.0
fasttree Only one version ? Might consider using FastTree2
MAFFT v7.490 (2021/Oct/30)
Porechop-0.2.4
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
samtools 1.13
vsearch v2.21.1_linux_x86_64

I should probably fix the library verisons as well

Need to update bwa to bwa-mem2

I used bwa for simplicity sake, but now I need to change it for bwa-mem2 which is supposed to be more memory efficient and faster

Phyloseq installation is as long as SILVA indexing

Phyloseq installation is way too long.

Maybe because of depencies = TRUE

Indexing parallelisation ?

I wonder if possible to parallelize the indexing step (which is clearly the most computer intensive during the build process

Should remove singleton before multiple alignements

Singletons are discarded anyway, it will reduce computation time

Singularity keeps looking for bin outside the container

This drives me crazy.

On the IFB cluster, NanoASV running with singularity :

/shared/software/modules/4.6.1/init/bash: line 37: /usr/bin/tclsh: No such file or directory

The whole purpose of a container is to NOT LOOK OUTSIDE OF IT isn't it ?

Chimera detection #2

Vsearch seems to never detect chimera with default parameters.

I think it lies on the fact that sequences are not dereplicated and therefore do not have a "count" section in fasta header.
However, I think dereplication might not work because vsearch expects 100% similarity. Which is rarely (if not) achieved with nanopore amplicon sequencing.
Efficient dereplication would come from accepting a certain variability threshold that would end up being clustering. Such clustering with vsearch performs well with a --id 0.7. Which is significantly lower than what we would want to accept for dereplication. If clustering, then it's not ASV treatment anymore.

I need to discuss it with you @frederic-mahe

To do : Reimplement unknown sequences cluster discard

For unknown reasons, the unknown cluster abundance < 5 are not discarded anymore and they make my alignement a nightmare with all these singletons. I got to change that.

Singularity version known Issue : WARNING: could not mount /etc/localtime: not a directory

WARNING: could not mount /etc/localtime: not a directory

This warning message does not appear when running with docker but does with singularity.

It seemingly has no consequences on downstream analyses.

Docker image run with singularity. Error phyloseq R package : "Cannot open shared object"

Error: package or namespace load failed for ‘phyloseq’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/imago/R/x86_64-pc-linux-gnu-library/4.3/stringi/libs/stringi.so':
  libicui18n.so.70: cannot open shared object file: No such file or directory
Execution halted

Something might have changed.

I'm pretty sure that's because of ubuntu:latest

Gotta change for ubuntu 22.04

I'm sure at some point it was specified. IDK what happened

Replace NanoFilt and Porechop with Chopper

conda install -c bioconda chopper

The problem is that I still need to install conda in my dockerfile

use vsearch option to eliminate a call to awk

NanoASV/script.sh

Line 428 in 79f3132

    
           awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' Consensus_seq_OTU.fasta > singleline_Consensus_seq_OTU.fasta

with the option --fasta_width 0, vsearch produces unwrapped/unfolded fasta files.

Running nanoasv smoothly

It seems that singularity is instantly called when running nanoasv, which makes the following
singularity run nanoasv --options uneccessary. If nanoasv singularity file is executable, then just ./nanoasv or nanoasv is you put it in /opt/ and add it to the $PATH

A nice way to do it

echo 'export PATH=$PATH:/opt/' >> ~/.bashrc && source ~/.bashrc

which makes

~$ nanoasv 
WARNING: could not mount /etc/localtime: not a directory
 ______________________________________
/ Error: -d needs an argument, I don't \
\ know where your sequences are.       /
 --------------------------------------
        \   ^__^
         \  (xx)\_______
            (__)\       )\/\
             U  ||----w |
                ||     ||

Lovely

aarch64 - MK1C fail on minimal dataset

Step 4/9 : Adapter trimming with Porechop
Step 5/9 : Subsampling
Step 6/9 : Reads alignements with bwa against SILVA_138.1
environment: line 1:   218 Segmentation fault      (core dumped) bwa mem ${DB}/SILVA_IDX "${FILE}" 2> /dev/null > "${FILE}.sam"
environment: line 1:   221 Segmentation fault      (core dumped) bwa mem ${DB}/SILVA_IDX "${FILE}" 2> /dev/null > "${FILE}.sam"
Step 7/9 : Skipped - no unknown sequence
Step 8/9 : Phylogeny with MAFFT and FastTree
Step 9/9 : Phylosequization with R and phyloseq
Data treatment is over.
NanoASV took 144 seconds to perform.

This indicates a memory related error.

If only the MK1C was running dozens of useless job in background.

I'll find a way

imagoxv / nanoasv Goto Github PK

nanoasv's People

Contributors

Stargazers

Watchers

nanoasv's Issues

Recommend Projects

Recommend Topics

Recommend Org