Coder Social home page Coder Social logo

Comments (7)

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Thanks, I will take a look. The new edlib library was failing for this search for reasons I don't understand yet, thus the amptk database command I reverted to the old method, perhaps I just didn't fix something correctly.

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Seems like it was a typo, should be fixed here 56e9de9

from amptk.

vmikk avatar vmikk commented on August 20, 2024

Thank you for the quick fix!
Dereplication is working now, however there is another issue with sequence headers - they are replaced with a single letter (see *.extracted.fa file):

>;
AACGCACATTG...
>:
AACGCACATTG...
>=
AACGCACATTG...
>A
AACGCACATTG...
>C
AACGCACATTG...

As a result, only 69 sequences remains in the dereplicated file.
This is the command I used:

amptk database \
  -i sh_general_release_dynamic_s_28.06.2017_dev.fasta \
  -o ITS2_UTAX_new \
  --create_db utax \
  -f fITS7 -r ITS4 \
  --derep_fulllength

Here is the log messages:

[09:07:31 AM]: Searching for primers, this may take awhile: Fwd: fITS7  Rev: ITS4
[09:07:32 AM]: 58,153 records loaded
[09:07:32 AM]: Using 8 cpus to process data
[09:09:54 AM]: 56,811 records passed (97.69%)
[09:09:54 AM]: Now dereplicating sequences (remove if sequence and header identical)
[09:09:55 AM]: 69 records passed (0.12%)
[09:09:55 AM]: USEARCH version: usearch v9.2.64_i86linux32
[09:09:55 AM]: Creating UTAX Database, this may take awhile
[09:09:55 AM]: There was a problem creating the DB, check the UTAX log file /home/mik/amptk/DB/ITS2_UTAX_new.utax.log

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Sorry about that, seems like the dereplication function is broken, this should fix it 4a2cc3f. I'm now having it look for identical sequences, then compare taxonomy levels and it will take header with more taxonomy information. If taxonomy levels are the same but not equal, it will move up one taxonomy level and move on, i.e. if two ITS2 sequences are identical from two different species, it will then move up and rename that sequence only to the genus level as identification to species isn't possible and will only confuse the UTAX classifier.
And btw, the ITS database packaged with AMPtk are up to date with newest version of UNITE, so you can just use the pre-formatted versions (unless you are adding custom sequences, etc). I will re-make the pre-formated databases and get them swapped with the new method hopefully later today. I will then release a new version with the updates. Let me know if this is still not working for you.

from amptk.

vmikk avatar vmikk commented on August 20, 2024

Thank you for the bug fix!
The approach with taxonomy of the last common ancestor (LCA) for identical sequences looks very reasonable. However there are two notes:

  • How to handle sequence IDs? As I understand, accession number of the first sequence is taken now. Maybe it’s better to mark the sequence header to indicate that it is LCA taxonomy or maybe just to concatenate sequence IDs? This will help to track down all initial sequences if they will match to a query.
  • Does the consensus taxonomy follow the majority rule now? I tried to experiment with different cases - I duplicated several sequences and renamed their taxonomy. So there are three Entoloma species and one Entolomella:
>…tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Entolomataceae,g:Entoloma,s:Entoloma_one
>…tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Entolomataceae,g:Entoloma,s:Entoloma_two
>…tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Entolomataceae,g:Entoloma,s:Entoloma_three
>…tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Entolomataceae,g:Entolomella,s:Entolomella_four

Taxonomy of the deduplicated sequence was:

>…tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Entolomataceae,g:Entoloma

Shouldn’t it be?:

>…tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Entolomataceae

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

This should be more dynamic 18b1458. thinking about adding an --lca option as there are maybe times you want to dereplicate but not reduce taxonomy. I typically only use the dereplicate function when creating the UTAX databases, otherwise just keep all the data for a global alignment database via USEARCH. In the case of the UTAX trained databases, the sequence accession is lost anyway during taxonomy assignment.

from amptk.

vmikk avatar vmikk commented on August 20, 2024

Thank you for the fix!

Exactly, the point was about taxonomy reduction in the case when just the global alignments are used (via USEARCH).

from amptk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.