Coder Social home page Coder Social logo

cgMLST schemes? about sting HOT 21 OPEN

slvrshot avatar slvrshot commented on August 19, 2024
cgMLST schemes?

from sting.

Comments (21)

ar0ch avatar ar0ch commented on August 19, 2024 1

Got it. I'll try and take a look this evening

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

If you take a look at the indexer docs, they should look very familiar since you're a stringMLST user. If you have a scheme you've built with stringMLST, you can use the same input files for STing. The docs and examples focus around the 7 locus schemes since they tend to be easier to work with and faster to build. Did you have a particular scheme in mind?

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

If you take a look at the indexer docs, they should look very familiar since you're a stringMLST user. If you have a scheme you've built with stringMLST, you can use the same input files for STing. The docs and examples focus around the 7 locus schemes since they tend to be easier to work with and faster to build. Did you have a particular scheme in mind?

Thanks for the rapid response. The Neisseria cgMLST v1.0 cgc400 scheme on pubMLST (Harrison et al.). is what I am trying to work with. One problem I am having is downloading all of the allele sequences. pubMLST seems to limit the number I am able to download at once. The scheme has over 1.6K loci.

https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=batchSequenceQuery

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

It looks like the download links are in: https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=downloadAlleles&locus= format

So let's try something a little brutish to get everything.

First, head over to the export tab and just export the profile table for the STs (saved as profiles.txt): https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=job&id=BIGSdb_043821_1608874811_82059

Extract the header and clean it up:

head -n 1 profiles.txt | tr '\t' '\n' | tail -n +2 | head

Hopefully you see something like:
NEIS0001
NEIS0004
NEIS0005
NEIS0006
NEIS0007
NEIS0008
NEIS0009
NEIS0010
NEIS0011
NEIS0012

Download the alleles:

head -n 1 profiles.txt | tr '\t' '\n' | tail -n +2 | xargs -I nmb wget "https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=downloadAlleles&locus=nmb" -O nmb.fasta

That should get you all the files you need. I'm on my phone and haven't tested it though. Good luck!

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

Oh, and merry Christmas 🙂

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

Oh, and merry Christmas 🙂

I figured there was some bash script I could use to pull the allele sequences out! Thanks so much I will test this later today. And a very blessed and Merry Christmas to you too!!!

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

Oh, and merry Christmas 🙂

Hello again! That worked. However I'm having issues with creating the database. First I tried your demo but I get an error message:

db_util.py fetch --query "Neisseria spp." --out_dir my_dbs --build_index
Traceback (most recent call last):
  File "/apps/x86_64/miniconda3/lib/python3.8/urllib/request.py", line 1350, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/apps/x86_64/miniconda3/lib/python3.8/http/client.py", line 1240, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/apps/x86_64/miniconda3/lib/python3.8/http/client.py", line 1286, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/apps/x86_64/miniconda3/lib/python3.8/http/client.py", line 1235, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/apps/x86_64/miniconda3/lib/python3.8/http/client.py", line 1006, in _send_output
    self.send(msg)
  File "/apps/x86_64/miniconda3/lib/python3.8/http/client.py", line 946, in send
    self.connect()
  File "/apps/x86_64/miniconda3/lib/python3.8/http/client.py", line 1409, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/apps/x86_64/miniconda3/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/apps/x86_64/miniconda3/lib/python3.8/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/apps/x86_64/miniconda3/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1108)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apps/x86_64/STing/STing/db_util.py", line 272, in <module>
    main()
  File "/apps/x86_64/STing/STing/db_util.py", line 261, in main
    dbFileContent = get_db_file_content(dbFileUrl)
  File "/apps/x86_64/STing/STing/db_util.py", line 74, in get_db_file_content
    response = urllib.request.urlopen(fileUrl)
  File "/apps/x86_64/miniconda3/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/apps/x86_64/miniconda3/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/apps/x86_64/miniconda3/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/apps/x86_64/miniconda3/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/apps/x86_64/miniconda3/lib/python3.8/urllib/request.py", line 1393, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/apps/x86_64/miniconda3/lib/python3.8/urllib/request.py", line 1353, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1108)>

I also tried building my own database file and I get a message stating that I'm missing the loci section. I'm not sure why that is as I followed your example database file format. I've uploaded my file.

Thanks!

test2.txt

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

The first set of errors looks like you're running python on a system with old/outdated (or completely missing) SSL certificates or that's sitting behind an SSL-stripping middleware. Running PYTHONHTTPSVERIFY=0 db_util.py fetch --query "Neisseria spp." --out_dir my_dbs --build_index should bypass that (though, obviously, not recommended as a general usage thing). You might need to install the certificates package (pip install --upgrade certifi and/or conda install -c anaconda certifi)

This config.txt works for me.

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

The first set of errors looks like you're running python on a system with old/outdated (or completely missing) SSL certificates or that's sitting behind an SSL-stripping middleware. Running PYTHONHTTPSVERIFY=0 db_util.py fetch --query "Neisseria spp." --out_dir my_dbs --build_index should bypass that (though, obviously, not recommended as a general usage thing). You might need to install the certificates package (pip install --upgrade certifi and/or conda install -c anaconda certifi)

This config.txt works for me.

Okay strangely...my tab delimited file does not seem to work...but it does work when I delete everything below [loci] in your file and then paste the relative paths to the fasta files in my directories. So I'm almost there. However now I get this error:

Total sequences loaded: 1287967
Loading the profiles file... Done!
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

If you'd prefer I open a new issue please let me know. Thanks!

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

That's just your machine running out of allocatable memory

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

That's just your machine running out of allocatable memory

Hmmm...okay let me try submitting this as a job.

Thanks!

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

That's just your machine running out of allocatable memory

After submitting this as a job on my HPC...its still creating a database for the 1605 loci in the cgMLST scheme I am working on. So far there are over 500K database files that have been created and it is still not done. I've just left it running, but I was curious if this is normal?

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

That sounds about right for a cg
scheme.

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

That sounds about right for a cg
scheme.

Happy New Year! Okay I'm almost there....Question I have my database in a folder called "Ng" containing 24K files for 4 alleles....the files look like this:

db.prof_idx.txt.3105   db.txt.923
db.prof_idx.txt.3106   db.txt.924
db.prof_idx.txt.3107   db.txt.925
db.prof_idx.txt.3108   db.txt.926
db.prof_idx.txt.3109   db.txt.927

I just want to verify I am running this correctly, particularly the database prefix portion:

typer -x Ng/db -1 test.sim_1.fastq -2 test.sim_2.fastq -s test -o test.tsv

I should mention that even with these 4 alleles typer seems to be taking a very long time even on the compute cluster. Thanks again!

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

That sounds about right for a cg
scheme.

Happy New Year! Okay I'm almost there....Question I have my database in a folder called "Ng" containing 24K files for 4 alleles....the files look like this:

db.prof_idx.txt.3105   db.txt.923
db.prof_idx.txt.3106   db.txt.924
db.prof_idx.txt.3107   db.txt.925
db.prof_idx.txt.3108   db.txt.926
db.prof_idx.txt.3109   db.txt.927

I just want to verify I am running this correctly, particularly the database prefix portion:

typer -x Ng/db -1 test.sim_1.fastq -2 test.sim_2.fastq -s test -o test.tsv

I should mention that even with these 4 alleles typer seems to be taking a very long time even on the compute cluster. Its been stuck on the below for over 2 hrs despite being "Done!" for allele calling after only 10 minutes. Thanks again!

Loading the input index...  Done!
Processing reads...    Done!
Getting the top 2 alleles of each locus...  Done!
Calling alleles based on k-mer frequencies...   Done!

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

Did you build the full profile and did it finish running? A ~1600 loci scheme shouldn't take more than 40-50mins on a 60-70x coverage sequencing run. Admittedly the machine we run things on has a fair amount of memory, but since you could build the database (which takes more memory) you shouldn't have any issues (it'll take ~14GB).

If your reads are a mixture of more than one species though, (1) you shouldn't run them in STing, since it is designed for single sample runs, and (2) the allele selection and refinement step will take a long time

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

I did build the full 1600 loci scheme. Admittedly, I do have 200+ WGS samples. If it takes an hour per sample that's essentially a week and a half of processing. I'll give it a try again.

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

@ar0ch Okay finally ran my samples and it was relatively fast to run all 200. I did notice the profiler didn't predict any STs despite identifying the majority of alleles for each of the 1600+ loci. Happy to share the data. Can I email you? Thanks!

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

Sure -- mail [@] aroonchande.com

from sting.

slvrshot avatar slvrshot commented on August 19, 2024

Sure -- mail [@] aroonchande.com

I messaged you. Did it go through? The first attempt bounced back. Thanks!

from sting.

ar0ch avatar ar0ch commented on August 19, 2024

(I haven't forgeten this issue, just busy with other things)

from sting.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.