Coder Social home page Coder Social logo

Comments (26)

nextgenusfs avatar nextgenusfs commented on August 20, 2024 1

You have a slash in your output command which tells it to make a directory....

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

from amptk.

JonathanVanHamme avatar JonathanVanHamme commented on August 20, 2024

Thanks, Jon - looks like it is a python 2/3 problem as you thought. I created a virtual environment with Python 3.7.0 and reran the command. I got a warning, but it looks like it finished the job:

(myvenv) [email protected] [~/Downloads/2018Aug28_COI]$ /usr/local/Cellar/amptk/1.2.4/libexec/bin/bold2amptk.py -i arthropoda.bold.bins.fa -o arthropods
Loading 410,649 sequence records
Searching for forward primer: GGTCAACAAATCATAAAGATATTGG, and reverse primer: GGSACSGGSTGAACSGTSTAYCCYCC
Requiring reverse primer match with at least 4 mismatches
121,174 seqs passed -> now dereplicating
/usr/local/Cellar/amptk/1.2.4/libexec/bin/bold2amptk.py:37: DeprecationWarning: 'U' mode is deprecated
with open(input, 'rU') as in_file:
98,493 unique records written to arthropods.all4usearch.fa
90,000 records for UTAX training written to arthropods.genus4utax.fa

After that, I tried running 'amptk database' but got some more errors:

(myvenv) [email protected] [~/Downloads/2018Aug28_COI]$ /usr/local/Cellar/amptk/1.2.4/libexec/bin/amptk database -i arthropods.genus4utax.fa -o COI_UTAX --format off --create_db utax --skip_trimming

[Aug 28 03:03 PM]: OS: MacOSX 10.11.6, 8 cores, ~ 34 GB RAM. Python: 3.7.0
Traceback (most recent call last):
File "/usr/local/Cellar/amptk/1.2.4/libexec/bin/amptk-extract_region.py", line 416, in
amptklib.SystemInfo()
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 210, in SystemInfo
mod_versions()
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 225, in mod_versions
if not gvc(vers, '1.2.1'):
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 134, in gvc
if versiontuple(input) >= versiontuple(check):
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 131, in versiontuple
return tuple(map(int, (v.split("."))))
ValueError: invalid literal for int() with base 10: 'post1'

So... I reverted back to Python 2.7.14 and it appears to be working:

$ amptk database -i arthropods.genus4utax.fa -o COI_UTAX --format off --create_db utax --skip_trimming

[Aug 28 03:05 PM]: OS: MacOSX 10.11.6, 8 cores, ~ 34 GB RAM. Python: 2.7.14
[Aug 28 03:05 PM]: AMPtk v1.2.4-af8f8f11e3, USEARCH v9.2.64, VSEARCH v2.8.1
[Aug 28 03:05 PM]: Working on file: arthropods.genus4utax.fa
[Aug 28 03:05 PM]: 90,000 records loaded
[Aug 28 03:05 PM]: Using 8 cpus to process data
[Aug 28 03:05 PM]: 90,000 records passed (100.00%)
[Aug 28 03:05 PM]: Errors: 0 no taxonomy info, 0 length out of range, 0 too many ambiguous bases, 0 no primers found
[Aug 28 03:05 PM]: Creating UTAX Database, this may take awhile

Some of the other commands, such as "amptk SRA-submit" don't work for me unless I use AMPtk 1.0.3 with Python 2.7.14. Must be something I have messed up with my configuration - that's what happens when you set amateurs loose on the command line.

Thanks again for the help, hope all is well!
Jon

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Sorry you are having to deal with this -- if you switch over to conda based installation should probably be better in the future. I'll see if I can track down this py2/3 problem and get it fixed in next release.

from amptk.

JonathanVanHamme avatar JonathanVanHamme commented on August 20, 2024

No worries, Jon - I'll get the conda install on my desktop. I got through everything for making the new database on my server, except the final step of the UTAX database. Do I need to purchase the 64-bit version of usearch to get this done?

usearch v9.2.64_i86linux32, 4.0Gb RAM (115Gb total), 24 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

00:00 78Mb 100.0% Reading /miniconda2/envs/amptk/opt/amptk-1.2.4/DB/COI_UTAX.extracted.fa
00:00 44Mb 100.0% Converting to upper case
00:01 45Mb 100.0% Word stats
00:01 45Mb 100.0% Alloc rows
00:02 155Mb 100.0% Build index
00:02 162Mb 100.0% Initialize taxonomy data
00:02 163Mb 100.0% Building name table
00:02 163Mb 31348 names, tax levels min 2, avg 5.3, max 7
00:03 199Mb 100.0% Word stats
00:03 199Mb 100.0% Alloc rows
00:05 308Mb 100.0% Build index
38:36 4.3Gb 23.1% Distance matrix/usort

usearch9 -makeudb_utax /miniconda2/envs/amptk/opt/amptk-1.2.4/DB/COI_UTAX.extracted.fa -output /miniconda2/envs/amptk/opt/amptk-1.2.4/DB/COI_UTAX.udb -report /miniconda2/envs/amptk/opt/amptk-1.2.4/DB/COI_UTAX.report.txt -utax_trainlevels kpcofgs -utax_splitlevels NVkpcofgs -notrunclabels

---Fatal error---
Memory limit of 32-bit process exceeded, 64-bit build required

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Yes that is what the error is from. You can try to limit the sequences into that step, I think I have an —lca option that might be useful.

from amptk.

JonathanVanHamme avatar JonathanVanHamme commented on August 20, 2024

Hey Jon,
I tried with the 64-bit version of usearch, but it is v11.0.667 - that failed. If I get a hold of the 64-bit version of v9, think that would be compatible? I tried limiting the sequences with --lca and --derep_fulllength but that didn't help.

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Okay. The Utax command was phased out in usearch10 I believe. One of the many problems relying on that closed source software.... I haven’t tried v11 yet.

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

You also don’t have to use the hybrid taxonomy version either. You could use usearch or sintax separately — which doesn’t require a UTAX trained database. But the fasta headers still need to be properly formatted. So if you just leave out the create_db option it will generate the fasta file which you can pass to —fasta_db in the taxonomy script coupled with -m usearch to just use global alignment.

from amptk.

JonathanVanHamme avatar JonathanVanHamme commented on August 20, 2024

That's brilliant, Jon - I'll do that! I definitely owe you a beer, or a keg of beer.

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Looks like I mistyped above (was on my phone). Basically you can just use global alignment if you pass --method usearch to amptk taxonomy. If the fasta file is too large to create a usearch database with amptk database --create_db usearch than the alternative is to use the --fasta_db option which will invoke vsearch (no memory issues as is 64 bit) and run global alignment that way. I'm hoping you don't run into errors (you might as I can't recall off the top of my head what happens if you don't pass a --db option. The key is just to make sure FASTA headers are formatted correctly https://amptk.readthedocs.io/en/latest/taxonomy.html#taxonomy-databases for format. But then you can also take a look at the pre-packaged FASTA files as well to see format. It's a lot of work to get the scripts to reformat from the different data sources, thus it isn't perfect and you might still run into some issues (basically you might find something in those headers that I hadn't seen before and therefore isn't getting processed correctly).

from amptk.

JonathanVanHamme avatar JonathanVanHamme commented on August 20, 2024

Cool, I'm having fun learning regex these days, and torturing my students with parsing challenges, so I'll keep a close eye on the headers. If I come across anything interesting that may be useful, I'll let you know.

from amptk.

JonathanVanHamme avatar JonathanVanHamme commented on August 20, 2024

Hello again, Jon. Robert Edgar was kind enough to send me the 64-bit version of USEARCH v9, so I was able to generate the UTAX database.

Running AMPtk v 1.2.4

------------------------------
Taxonomy Databases Installed:
------------------------------
DB_name    DB_type    FASTA originated from   Fwd Primer Rev Primer Records     Date   
   COI.udb  usearch   arthropoda.bold.bins.fa    None       None     410478  2018-08-28
COI_UTAX.udb     utax  arthropods.genus4utax.fa    None       None      90000  2018-09-17

But, when I try to assign taxonomy I get an error indicating that the 'otuDict' is not defined (see below). Is that another file I need to generate? Everything runs properly when I use your COI database, and I have tried Python2 and Python3, as well as AMPtk 1.0.3.

Thanks!
Jon

jvanhamme@ca [$ amptk taxonomy -f clusterINSECTS.filtered.otus.fa -i clusterINSECTS.otu_table.txt -m Jordann_AMPtk_Mapping_ALLsamples_ONCE.txt -d COI
-------------------------------------------------------
[Sep 24 05:16 PM]: OS: MacOSX 10.11.6, 8 cores, ~ 34 GB RAM. Python: 2.7.14
[Sep 24 05:16 PM]: AMPtk v1.2.4-af8f8f11e3, USEARCH v9.2.64, VSEARCH v2.8.1
[Sep 24 05:16 PM]: Loading FASTA Records
[Sep 24 05:16 PM]: 422 OTUs
[Sep 24 05:16 PM]: Global alignment OTUs with usearch_global (USEARCH)
[Sep 24 05:17 PM]: Classifying OTUs with UTAX (USEARCH)
[Sep 24 05:17 PM]: Classifying OTUs with SINTAX (USEARCH)
[Sep 24 05:18 PM]: Appending taxonomy to OTU table and OTUs
Traceback (most recent call last):
  File "/usr/local/Cellar/amptk/1.2.4/libexec/bin/amptk-assign_taxonomy.py", line 397, in <module>
    tax = otuDict.get(rec.id) or "No hit"
NameError: name 'otuDict' is not defined

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Sounds like a bug. Will try to look at it as soon as I have some time.

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Not sure why this is happening -- I don't see anything obvious in the code. Can you confirm that the output files have been created successfully? i.e. you should have $base.usearch.txt, $base.utax.txt, and $base.sintax.txt files in directory. The script parses those results -- perhaps running something like head on each of those files so I can see the output would be helpful. Either that, or would you be able to provide a smaller test set that I can reproduce the error locally?

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

647054c will likely not error out anymore, but I don't think it will fix the underlying problem.

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Latest commit 17fc498 should be a bit better about checking if parsing taxonomy has worked -- it will write number of taxonomy assignments to log file for each step. Hopefully this tells you where the error is.

from amptk.

JonathanVanHamme avatar JonathanVanHamme commented on August 20, 2024

Hey Jon, Sorry for the silence, I got busy with other tasks & now I'm working on recreating the issues I was having. It looks like the $base.utax.txt file is not being generated, while the $base.usearch.txt and $base.sintax.txt files are (I have attached them).

clusterINSECTS.sintax.txt
clusterINSECTS.usearch.txt

from amptk.

bsmoda avatar bsmoda commented on August 20, 2024

Thanks, Jon - looks like it is a python 2/3 problem as you thought. I created a virtual environment with Python 3.7.0 and reran the command. I got a warning, but it looks like it finished the job:

(myvenv) [email protected] [~/Downloads/2018Aug28_COI]$ /usr/local/Cellar/amptk/1.2.4/libexec/bin/bold2amptk.py -i arthropoda.bold.bins.fa -o arthropods
Loading 410,649 sequence records
Searching for forward primer: GGTCAACAAATCATAAAGATATTGG, and reverse primer: GGSACSGGSTGAACSGTSTAYCCYCC
Requiring reverse primer match with at least 4 mismatches
121,174 seqs passed -> now dereplicating
/usr/local/Cellar/amptk/1.2.4/libexec/bin/bold2amptk.py:37: DeprecationWarning: 'U' mode is deprecated
with open(input, 'rU') as in_file:
98,493 unique records written to arthropods.all4usearch.fa
90,000 records for UTAX training written to arthropods.genus4utax.fa

After that, I tried running 'amptk database' but got some more errors:

(myvenv) [email protected] [~/Downloads/2018Aug28_COI]$ /usr/local/Cellar/amptk/1.2.4/libexec/bin/amptk database -i arthropods.genus4utax.fa -o COI_UTAX --format off --create_db utax --skip_trimming

[Aug 28 03:03 PM]: OS: MacOSX 10.11.6, 8 cores, ~ 34 GB RAM. Python: 3.7.0
Traceback (most recent call last):
File "/usr/local/Cellar/amptk/1.2.4/libexec/bin/amptk-extract_region.py", line 416, in
amptklib.SystemInfo()
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 210, in SystemInfo
mod_versions()
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 225, in mod_versions
if not gvc(vers, '1.2.1'):
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 134, in gvc
if versiontuple(input) >= versiontuple(check):
File "/usr/local/Cellar/amptk/1.2.4/libexec/lib/amptklib.py", line 131, in versiontuple
return tuple(map(int, (v.split("."))))
ValueError: invalid literal for int() with base 10: 'post1'

So... I reverted back to Python 2.7.14 and it appears to be working:

$ amptk database -i arthropods.genus4utax.fa -o COI_UTAX --format off --create_db utax --skip_trimming

[Aug 28 03:05 PM]: OS: MacOSX 10.11.6, 8 cores, ~ 34 GB RAM. Python: 2.7.14
[Aug 28 03:05 PM]: AMPtk v1.2.4-af8f8f11e3, USEARCH v9.2.64, VSEARCH v2.8.1
[Aug 28 03:05 PM]: Working on file: arthropods.genus4utax.fa
[Aug 28 03:05 PM]: 90,000 records loaded
[Aug 28 03:05 PM]: Using 8 cpus to process data
[Aug 28 03:05 PM]: 90,000 records passed (100.00%)
[Aug 28 03:05 PM]: Errors: 0 no taxonomy info, 0 length out of range, 0 too many ambiguous bases, 0 no primers found

Hello there, I'm dealing wiht this same error, but reverting back to python 2.7 didn't work. Do you know what might be the problem?

Thanks

Error with python 3.6:
(amptk) [bruno.moda@node06 test_data]$ amptk illumina -i illumina_test_data/ -o miseq -f fITSt -r ITS4
/mnt/data2/lbcb/conda/envs/amptk/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibi$
ity. Expected 96, got 88
return f(*args, **kwds)
/mnt/data2/lbcb/conda/envs/amptk/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibi$
ity. Expected 96, got 88
return f(*args, **kwds)

[03:59:24 PM]: OS: CentOS 6.5, 24 cores, ~ 66 GB RAM. Python: 3.6.5
Traceback (most recent call last):
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/bin/amptk-process_illumina_folder.py", line 203, in
amptklib.SystemInfo()
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 210, in SystemInfo
mod_versions()
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 225, in mod_versions
if not gvc(vers, '1.2.1'):
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 134, in gvc
if versiontuple(input) >= versiontuple(check):
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 131, in versiontuple
return tuple(map(int, (v.split("."))))
ValueError: invalid literal for int() with base 10: 'post1'

Error after reverting to python 2.7:
(amptk) [bruno.moda@node06 test_data]$ amptk illumina -i illumina_test_data -o miseq -f fITS7 -r ITS4

[04:28:02 PM]: OS: linux2, 24 cores, ~ 66 GB RAM. Python: 2.7.15
Traceback (most recent call last):
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/bin/amptk-process_illumina_folder.py", line 203, in
amptklib.SystemInfo()
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 210, in SystemInfo
mod_versions()
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 225, in mod_versions
if not gvc(vers, '1.2.1'):
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 134, in gvc
if versiontuple(input) >= versiontuple(check):
File "/mnt/data2/lbcb/conda/envs/amptk/opt/amptk-1.2.4/lib/amptklib.py", line 131, in versiontuple
return tuple(map(int, (v.split("."))))
ValueError: invalid literal for int() with base 10: 'post1'

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

I think this was fixed in v1.2.5.

from amptk.

bsmoda avatar bsmoda commented on August 20, 2024

How can I update to this version (v1.2.5) in the conda env? The most recent version of the amptk software in conda is v.1.2.4

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

Yes ... I’m aware. Basically I have to rewrite/package AMPtk to be compatible with the updates to bioconda. You can use homebrew as an alternative. You simply clone the repo and use the dependencies from conda.

from amptk.

bsmoda avatar bsmoda commented on August 20, 2024

I'm using Linux on the lab and I can't use homebrew. I've tried cloning the repo from git, and altered the PATH to the bin directory of git. Then I run the tutorial commands within the conda env of amptk, but I got the error that usearch9 is missing, and actually there is no usearch9 in the bin folder of the conda env. I've tried to change the --merge_method to vsearch, but I guess that the problem is with -u parameter, which have usearch9 as default and there is no other option.

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

You have to download the usearch binary manually for any install. Make it executable and then softlink it into your path to usearch9. That part isn’t any different for conda than any other install.

from amptk.

bsmoda avatar bsmoda commented on August 20, 2024

I've downloaded the usearch and was able to run the first step of the pipeline. Thank you!
I got a different result from the first step when comparing with the tutorial:

(amptk) [bruno.moda@node06 test_data]$ amptk illumina -i illumina_test_data -o miseq -f fITS7 -r ITS4

[11:57:01 AM]: OS: linux2, 24 cores, ~ 66 GB RAM. Python: 2.7.15
[11:57:02 AM]: AMPtk v1.2.5-730aa90, USEARCH v9.2.64, VSEARCH v2.10.4
[11:57:02 AM]: Gzipped files detected, uncompressing
[11:57:03 AM]: fITS7 fwd primer found in AMPtk primer db, setting to: GTGARTCATCGAATCTTTG
[11:57:03 AM]: ITS4 rev primer found in AMPtk primer db, setting to: TCCTCCGCTTATTGATATGC
[11:57:03 AM]: Demuxing data using 24 cpus
[11:57:03 AM]: Dropping reads less than 100 bp and setting lossless trimming to 300 bp.
[11:57:03 AM]: Strip Primers and Merge PE reads. FwdPrimer: GTGARTCATCGAATCTTTG RevPrimer: TCCTCCGCTTATTGATATGC

[11:57:04 AM]: Concatenating Demuxed Files
[11:57:04 AM]: 600 total reads
[11:57:04 AM]: 600 Fwd Primer found, 599 Rev Primer found
[11:57:04 AM]: 0 discarded Primer incompatibility
[11:57:04 AM]: 0 discarded too short (< 100 bp)
[11:57:04 AM]: 599 valid output reads
[11:57:04 AM]: Found 3 barcoded samples
Sample: Count
spike: 400
301-2: 100
301-1: 99
[11:57:04 AM]: Output file: miseq.demux.fq.gz (53.0 KB)
[11:57:04 AM]: Mapping file: miseq.mapping_file.txt

Example of next cmd: amptk cluster -i miseq.demux.fq.gz -o out

The number of rev Primer found was higher (599) and the valid output reads was of 599. Maybe it was because I'm using the test data from version 1.2.5.
But when I run the clustering step I got this error:

(amptk) [bruno.moda@node06 test_data]$ amptk cluster -i miseq.demux.fq.gz -o miseq/

[12:08:05 PM]: OS: linux2, 24 cores, ~ 66 GB RAM. Python: 2.7.15
[12:08:06 PM]: AMPtk v1.2.5-730aa90, USEARCH v9.2.64, VSEARCH v2.10.4
[12:08:06 PM]: Loading FASTQ Records
Traceback (most recent call last):
File "/mnt/data2/lbcb/projects/Bruno.Moda/metagenomics/amptk_git/bin/amptk-OTU_cluster.py", line 93, in
orig_total = amptklib.countfasta(orig_fasta)
File "/mnt/data2/lbcb/projects/Bruno.Moda/metagenomics/amptk_git/lib/amptklib.py", line 345, in countfasta
with open(input, 'rU') as f:
IOError: [Errno 2] No such file or directory: u'miseq/_tmp/miseq/.orig.fa'

The fact that I'm using conda-env of v.1.2.4 and the repo from git (1.2.5) could be related with this error?

from amptk.

nextgenusfs avatar nextgenusfs commented on August 20, 2024

database creation has been updated in the version 1.4.0, please see updated docs. open a new issue if this is a persistent or you get a different problem.

from amptk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.