Coder Social home page Coder Social logo

jordanlab / stringmlst Goto Github PK

View Code? Open in Web Editor NEW
41.0 8.0 7.0 5.46 MB

Fast k-mer based tool for multi locus sequence typing (MLST)

License: Other

Python 98.82% Shell 1.18%
python mlst bacterial-genomes bacterial-database bioinformatics fastq-files kmer applied-bioinformatics-lab

stringmlst's Introduction

stringMLST

Fast k-mer based tool for multi locus sequence typing (MLST) stringMLST is a tool for detecting the MLST of an isolate directly from the genome sequencing reads. stringMLST predicts the ST of an isolate in a completely assembly and alignment free manner. The tool is designed in a light-weight, platform-independent fashion with minimum dependencies.

Some portions of the allele selection algorithm in stringMLST are patent pending. Please refer to the PATENTS file for additional inforamation regarding licencing and use.

Reference http://jordan.biology.gatech.edu/page/software/stringmlst/

Abstract http://bioinformatics.oxfordjournals.org/content/early/2016/09/06/bioinformatics.btw586.short?rss=1

Application Note http://bioinformatics.oxfordjournals.org/content/early/2016/09/06/bioinformatics.btw586.full.pdf+html

install with bioconda PyPI version downloads container ready

stringMLST is a tool not a database, always use the most up-to-date database files as possible. To facilitate keeping your databases updated, stringMLST can download and build databases from pubMLST using the most recent allele and profile definitions. Please see the "Included databases and automated retrieval of databases from pubMLST" section below for instructions. The databases bundled here are for convenience only, do not rely on them being up-to-date.

stringMLST is licensed and distributed under CC Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) and is free for academic users and requires permission before any commercial use for any version of this code/algorithm. If you are a commercial user, please contact [email protected] for permissions

Recommended installation method

pip install stringMLST

Installation via git (Not recommended for most users)

git clone https://github.com/jordanlab/stringMLST
# Optional, download prebuilt databases
# We don't recommend this method, instead build the databases locally
cd stringMLST
git submodule init
git submodule update

Quickstart guide

pip install stringMLST
mkdir -p stringMLST_analysis; cd stringMLST_analysis
stringMLST.py --getMLST -P neisseria/nmb --species neisseria
# Download all available databases with:
# stringMLST.py --getMLST -P mlst_dbs --species all
wget  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_2.fastq.gz
stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz
Sample  abcZ    adk     aroE    fumC    gdh     pdhC    pgm     ST
ERR026529       231     180     306     612     269     277     260     10174

Python dependencies and external programs

stringMLST does not require any python dependencies for basic usage (Building databases and predicting STs).

For advanced used (genome coverage), stringMLST depends on the pyfaidx python module and bamtools, bwa, and samtools. See the coverage section for more information

stringMLST has been tested with:

pyfaidx: 0.4.8.1
samtools: 1.3 (Using htslib 1.3.1)  [Requires the 1.x branch of samtools]
bedtools: v2.24.0
bwa: 0.7.13-r1126

To install the dependencies

# pyfaidx
pip install --user pyfaidx
# samtools
wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2 -o samtools-1.3.1.tar.bz2
tar xf samtools-1.3.1.tar.bz2
cd samtools-1.3.1.tar
make
make prefix=$HOME install
# bedtools
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
cd bedtools2; make
cp ./bin/* ~/bin
# bwa
git clone https://github.com/lh3/bwa.git
cd bwa; make
cp bwa ~/bin/bwa
export PATH=$PATH:$HOME/bin

Usage for Example Read Files (Neisseria meningitidis)

  • Download stringMLST.py, example read files (ERR026529, ERR027250, ERR036104) and the dataset for Neisseria meningitidis (Neisseria_spp.zip).

Build database:

# Add dir to path
export PATH=$PATH:$PWD
# Will connect to EBI's SRA servers
download_example_reads.sh
  • Extract the MLST loci dataset.
unzip datasets/Neisseria_spp.zip -d datasets
  • Create or use a config file specifying the location of all the locus and profile files. Example config file (Neisseria_spp/config.txt):
[loci]
abcZ  datasets/Neisseria_spp/abcZ.fa
adk datasets/Neisseria_spp/adk.fa
aroE  datasets/Neisseria_spp/aroE.fa
fumC  datasets/Neisseria_spp/fumC.fa
gdh datasets/Neisseria_spp/gdh.fa
pdhC  datasets/Neisseria_spp/pdhC.fa
pgm datasets/Neisseria_spp/pgm.fa
[profile]
profile datasets/Neisseria_spp/neisseria.txt
  • Run stringMLST.py --buildDB to create DB. Choose a k value and prefix (optional).
stringMLST.py --buildDB -c databases/Neisseria_spp/config.txt -k 35 -P NM

Predict:

Single sample :

stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fastq -2 tests/fastqs/ERR026529_2.fastq -k 35 -P NM

Batch mode (all the samples together):

stringMLST.py --predict -d ./tests/fastqs/ -k 35 -P NM

List mode:

Create a list file (list_paired.txt) as :

tests/fastqs/ERR026529_1.fastq  tests/fastqs/ERR026529_2.fastq
tests/fastqs/ERR027250_1.fastq  tests/fastqs/ERR027250_2.fastq
tests/fastqs/ERR036104_1.fastq  tests/fastqs/ERR036104_2.fastq

Run the tool as:

stringMLST.py --predict -l list_paired.txt -k 35 -P NM

Working with gziped files

stringMLST.py --predict -1 tests/fastqs/ERR026529_1.fq.gz -2 tests/fastqs/ERR026529_2.fq.gz -p -P NM -k 35 -o ST_NM.txt

Usage Documentation

stringMLST's workflow is divided into two routines:

  • Database building and
  • ST discovery

Database building: Builds the stringMLST database which is used for assigning STs to input sample files. This step is required once for each organism. Please note that stringMLST is capable of working on a custom user defined typing scheme but its efficiency has not been tested on other typing scheme.

ST discovery: This routine takes the database created in the last step and predicts the ST of the input sample(s). Please note that the database building is required prior to this routine. stringMLST is capable of processing single-end and paired-end files. It can run in three modes:

  • Single sample mode - for running stringMLST on a single sample
  • Batch mode - for running stringMLST on all the FASTQ files present in a directory
  • List mode - for running stringMLST on all the FASTQ files provided in a list file
Readme for stringMLST
=============================================================================================
Usage
./stringMLST.py
[--buildDB]
[--predict]
[-1 filename_fastq1][--fastq1 filename_fastq1]
[-2 filename_fastq2][--fastq2 filename_fastq2]
[-d directory][--dir directory][--directory directory]
[-l list_file][--list list_file]
[-p][--paired]
[-s][--single]
[-c][--config]
[-P][--prefix]
[-z][--fuzzy]
[-a]
[-C][--coverage]
[-k]
[-o output_filename][--output output_filename]
[-x][--overwrite]
[-t]
[-r]
[-v]
[-h][--help]
==============================================================================================

There are two steps to predicting ST using stringMLST.
1. Create DB : stringMLST.py --buildDB
2. Predict : stringMLST --predict

1. stringMLST.py --buildDB

Synopsis:
stringMLST.py --buildDB -c <config file> -k <kmer length(optional)> -P <DB prefix(optional)>
  config file : is a tab delimited file which has the information for typing scheme ie loci, its multifasta file and profile definition file.
    Format :
      [loci]
      locus1    locusFile1
      locus2    locusFile2
      [profile]
      profile   profileFile
  kmer length : is the kmer length for the db. Note, while processing this should be smaller than the read length.
    We suggest kmer lengths of 35, 66 depending on the read length.
  DB prefix(optional) : holds the information for DB files to be created and their location. This module creates 3 files with this prefix.
    You can use a folder structure with prefix to store your db at particular location.

Required arguments
--buildDB
  Identifier for build db module
-c,--config = <configuration file>
  Config file in the format described above.
  All the files follow the structure followed by pubmlst. Refer extended document for details.

Optional arguments
-k = <kmer length>
  Kmer size for which the db has to be formed(Default k = 35). Note the tool works best with kmer length in between 35 and 66
  for read lengths of 55 to 150 bp. Kmer size can be increased accordingly. It is advised to keep lower kmer sizes
  if the quality of reads is not very good.
-P,--prefix = <prefix>
  Prefix for db and log files to be created(Default = kmer). Also you can specify folder where you want the dbb to be created.
-a
        File location to write build log
-h,--help
  Prints the help manual for this application

 --------------------------------------------------------------------------------------------

2. stringMLST.py --predict

stringMLST --predict : can run in three modes
  1) single sample (default mode)
  2) batch mode : run stringMLST for all the samples in a folder (for a particular specie)
  3) list mode : run stringMLST on samples specified in a file
stringMLST can process both single and paired end files. By default program expects paired end files.

Synopsis
stringMLST.py --predict -1 <fastq file> -2 <fastq file> -d <directory location> -l <list file> -p -s -P <DB prefix(optional)> -k <kmer length(optional)> -o <output file> -x

Required arguments
--predict
  Identifier for predict miodule

Optional arguments
-1,--fastq1 = <fastq1_filename>
  Path to first fastq file for paired end sample and path to the fastq file for single end file.
  Should have extension fastq or fq.
-2,--fastq2 = <fastq2_filename>
  Path to second fastq file for paired end sample.
  Should have extension fastq or fq.
-d,--dir,--directory = <directory>
  BATCH MODE : Location of all the samples for batch mode.
-C,--coverage
  Calculate seqence coverage for each allele. Turns on read generation (-r) and turns off fuzzy (-z 1)
  Requires bwa, bamtools and samtools be in your path
-k = <kmer_length>
  Kmer length for which the db was created(Default k = 35). Could be verified by looking at the name of the db file.
  Could be used if the reads are of very bad quality or have a lot of N's.
-l,--list = <list_file>
  LIST MODE : Location of list file and flag for list mode.
  list file should have full file paths for all the samples/files.
  Each sample takes one line. For paired end samples the 2 files should be tab separated on single line.
-o,--output = <output_filename>
  Prints the output to a file instead of stdio.
-p,--paired
  Flag for specifying paired end files. Default option so would work the same if you do not specify for all modes.
  For batch mode the paired end samples should be differentiated by 1/2.fastq or 1/2.fq
-P,--prefix = <prefix>
  Prefix using which the db was created(Defaults = kmer). The location of the db could also be provided.
-r
  A seperate reads file is created which has all the reads covering all the locus.
-s,--single
  Flag for specifying single end files.
-t
  Time for each analysis will also be reported.
-v
  Prints the version of the software.
-x,--overwrite
  By default stringMLST appends the results to the output_filename if same name is used.
  This argument overwrites the previously specified output file.
-z,--fuzzy = <fuzzy threshold int>
  Threshold for reporting a fuzzy match (Default=300). For higher coverage reads this threshold should be set higher to avoid
  indicating fuzzy match when exact match was more likely. For lower coverage reads, threshold of <100 is recommended
-h,--help
  Prints the help manual for this application

 --------------------------------------------------------------------------------------------

3. stringMLST.py --getMLST

Synopsis:
stringMLST.py --getMLST --species= <species> [-k kmer length] [-P DB prefix]

Required arguments
--getMLST
    Identifier for getMLST module
--species= <species name>
    Species name from the pubMLST schemes (use "--species show" to get list of available schemes)
    "all" will download and build all

Optional arguments
-k = <kmer length>
    Kmer size for which the db has to be formed(Default k = 35). Note the tool works best with kmer length in between 35 and 66
    for read lengths of 55 to 150 bp. Kmer size can be increased accordingly. It is advised to keep lower kmer sizes
    if the quality of reads is not very good.
-P,--prefix = <prefix>
    Prefix for db and log files to be created(Default = kmer). Also you can specify folder where you want the db to be created.
    We recommend that prefix and config point to the same folder for cleanliness but this is not required
--schemes
    Display the list of available schemes
-h,--help
  Prints the help manual for this application

stringMLST expects paired end reads to be in Illumina naming convention, minimally ending with _1.fq and _2.fq to delineate read1 and read2:

Periods (.) are disallowed delimiters except for file extensions

Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

For example, the following is a valid FASTQ file name:

NA10831_ATCACG_L002_R1_001.fastq.gz

Running stringMLST

Included databases and automated retrieval of databases from pubMLST

stringMLST includes all the pubMLST databases as of February 15, 2017, built with the default kmer (35). They can be found in the datasets/ folder. Simply unzip the databases you need and begin using stringMSLT as described below.

All the databases from pubMLST can be downloaded and prepared with your kmer choice

Getting all pubMLST schemes

stringMLST.py --getMLST -P datasets/ --species all

Individual databases from pubMLST can also be downloaded as needed, using the scheme identifiers

Downloading a scheme

# List available schemes
stringMLST.py --getMLST --schemes

# Download the Neisseria spp. scheme

stringMLST.py --getMLST -P datasets/nmb --species Neisseria

Database Preparation

In order to create the database, files can be downloaded from the database page.

If the organism of interest is not present in the provided link, the required files can be downloaded from PubMLST as follows:

  • On your browser, navigate to http://pubmlst.org/
  • Navigate to "Download MLST definitions" link or go to http://pubmlst.org/data/
  • Scroll to the species of interest. For each species, user may find the file for typing definitions and multi-FASTA files for each locus. Download these files.

E.g.:

Species of interest: Neisseria spp. Corresponding definition file: http://pubmlst.org/data/profiles/neisseria.txt Corresponding multi fasta locus files: http://pubmlst.org/data/alleles/neisseria/abcZ.tfa http://pubmlst.org/data/alleles/neisseria/adk.tfa http://pubmlst.org/data/alleles/neisseria/aroE.tfa http://pubmlst.org/data/alleles/neisseria/fumC.tfa http://pubmlst.org/data/alleles/neisseria/gdh.tfa http://pubmlst.org/data/alleles/neisseria/pdhC.tfa http://pubmlst.org/data/alleles/neisseria/pgm.tfa

Download these files at a desired location.

Custom user files can also be used for building database. The database building routine requires the profile definition file and allele sequence file. The profile definition file is a tab separated file that contains the ST and the allele profile corresponding to the ST. An example of the profile definition file is shown below:

ST  abcZ  adk aroE  fumC  gdh pdhC  pgm clonal_complex
1 1 3 1 1 1 1 3 ST-1 complex/subgroup I/II
2 1 3 4 7 1 1 3 ST-1 complex/subgroup I/II
3 1 3 1 1 1 23  13  ST-1 complex/subgroup I/II
4 1 3 3 1 4 2 3 ST-4 complex/subgroup IV

The allele sequence file is a standard multi-FASTA with the description being the loci name with the allele number. An example abcZ allele sequence is shown below:

>abcZ_1
TTTGATACTGTTGCCGA...
>abcZ_2
TTTGATACCGTTGCCGA...
>abcZ_3
TTTGATACCGTTGCGAA...
>abcZ_4
TTTGATACCGTTGCCAA...

These files can be obtained from PubMLST/BIGSdb or can be create by the user themselves.

In either case, an accompanying configuration file is also required to describe the profile definition and allele sequence files. An example configuration file is shown below:

[loci]
abcZ  /data/home/stringMLST/pubmlst/Neisseria_sp/abcZ.fa
adk /data/home/stringMLST/pubmlst/Neisseria_sp/adk.fa
aroE  /data/home/stringMLST/pubmlst/Neisseria_sp/aroE.fa
fumC  /data/home/stringMLST/pubmlst/Neisseria_sp/fumC.fa
gdh /data/home/stringMLST/pubmlst/Neisseria_sp/gdh.fa
pdhC  /data/home/stringMLST/pubmlst/Neisseria_sp/pdhC.fa
pgm /data/home/stringMLST/pubmlst/Neisseria_sp/pgm.fa

[profile]
profile /data/home/stringMLST/pubmlst/Neisseria_sp/neisseria.txt

This file is pre-packed on stringMLSTs website and can easily be created by the user for custom database.

Database Building

The next step is for database building is running the buildDB module to create the database files. buildDB module requires the user to specify the config file. The default k-mer size is 35 but can be changed using the -k option. Specifying the prefix for the created database files is optional but is recommended.

The choice of k-mer depends on the size of the sequencing read. In general, the value of k can never be greater than the read length. The application has been tested on a number of read lengths ranging from 55 to 150 bps using k-mer sizes of 21 to 66. In our testing, the k-mer size does not affect the accuracy of the read length. A smaller k-mer size will increase the runtime and a larger k-mer size will increase the file size. The user should ideally pick a k-mer with a length around half of the average read length. For lower quality data, it also advised to choose smaller k-mer values to reduce false hits.

stringMLST.py --buildDB --config <config file> -k  <k-mer length> -P <prefix>

Example:

stringMLST.py --buildDB --config config.txt -k 35 -P NM

This command will produce 3 database files and a log file. The log file is used for debugging purposes in the event an error is encountered. The 3 database files created are:

  • _.txt : The main database file for the application. This is a tab delimited file describing k-mer to locus relationship.
  • _weight.txt : Contains the weight factors for alleles which differ in lengths by more than 5%. Will be empty otherwise.
  • _profile.txt : Profile definition file used for finding the ST from the predicted allelic profile.

For the example above, the following files will be created: NM_35.txt, NM_weight.txt and NM_profile.txt

Please note that in the prediction routine the database is identified with the prefix.

ST discovery routine As discussed earlier, StringMLST has 3 running modes

  • Single sample mode - for running stringMLST on a single sample
  • Batch mode - for running stringMLST on all the FASTQ files present in a directory
  • List mode - for running stringMLST on all the FASTQ files provided in a list file

Single sample mode:

This is the default mode for stringMLST and takes in one sample at a time. The sample can be single-end or paired-end. The sample has to be in FASTQ format. In order to run, the user should know the prefix of the database created and the k-mer size.

By default, the tool expects paired-end samples.

stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -o <output file name>

For single-end samples:

stringMLST.py --predict -1 <single-end file> -s --prefix <prefix for the database> -k <k-mer size> -o <output file name>

Batch Mode:

This mode can be used for processing multiple files with one command. All the samples will be queried against the same database. Also all samples should be in the same directory. All the samples will be treated either as single-end or paired-end. The paired-end samples should be differentiated with the character _1 and _2 at the end (E.g.: sampleX_1.fastq and sampleX_2.fastq).

Paired-end samples:

stringMLST.py --predict -d <directory for samples> -p --prefix <prefix for the database> -k <k-mer size> -o <output file name>

Single-end samples:

stringMLST.py --predict -d <directory for samples> -s --prefix <prefix for the database> -k <k-mer size> -o <output file name>

List Mode:

This mode could be used if user has samples at different locations or if the paired-end samples are not stored in traditional way. All the samples will be queried against the same database. All the samples will be treated either as single-end or paired-end. This mode requires the user to provide a list file which has the list of all samples along with the location. Each line in the list file represents a new sample. A sample list file for single-end sample looks like the following.

<full path of sample 1 fastq file>
<full path of sample 2 fastq file>
<full path of sample 3 fastq file>
.
.
<full path of sample n fastq file>

A sample list file for paired-end sample looks like the following.

<full path of sample 1 fastq file 1>  <full path of sample 1 fastq file 2>
<full path of sample 2 fastq file 1>  <full path of sample 2 fastq file 2>
<full path of sample 3 fastq file 1>  <full path of sample 3 fastq file 2>
.
.
<full path of sample n fastq file 1>  <full path of sample n fastq file 2>

Once the user has the list file, he can directly use the tool.

Paired-end samples:

stringMLST.py --predict -l <full path to list file> -p --prefix <prefix for the database> -k <k-mer size> -o <output file name>

Single-end samples:

stringMLST.py --predict -l <full path to list file > -s --prefix <prefix for the database> -k <k-mer size> -o <output file name>

Gene coverage and match confidence

stringMLST provides two, complimentary methods for determining confidence in an inferred ST. There's the -C|--coverage flag and -z|--fuzzy threshold option.

stringMLST determines an allele based on its kmer support; the more kmers seen for allele 1, the more likely that allele 1 is the allele present in the genome. Unlike SRST2 and other mapping/BLAST based tools, stringMLST always infers an ST, using the maximimally supported allele (allele with most kmer hits). The difference between the maximum support (the reported allele) and the second support (next closest allele) can be informative for low coverage reads. The -z|--fuzzy threshold (Default = 300), assigns significance to the difference between supports. Much like SRST2 and Torsten Seemann's popular pubMLST script, stringMLST reports potentially new or closely supported alleles in allele* syntax. For high coverage reads, we suggest a fuzzy threshold >500. For low coverage reads, a fuzzy threshold of <50.

Coverage mode requires bedtools, bwa, and samtools in your PATH and an additional python module, pyfaidx (See the dependencies section for installion information). Coverage mode by default disables display of fuzzy alleles in favor of sequence coverage information made by mapping potential reads to the putative allele sequence. In our testing, coverage mode slightly increases prediction time (<1 sec increase per sample).

Please note: stringMLST always infers the ST from the reads, fuzzy matches and/or <100% coverage do not necessarily mean a new allele has been found.

Getting gene coverage from reads

stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -r -o <output file name>- -c <path to config> -C

Changing the fuzziness of the search for low coverage reads

stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -r -o <output file name>- -f 50

Other Examples :

Reporting time along with the output.

stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -t -o <output file name>

Getting reads file relevant to typing scheme.

stringMLST.py --predict -1 <paired-end file 1> -2 <paired-end file 2> -p --prefix <prefix for the database> -k <k-mer size> -r -o <output file name>

stringmlst's People

Contributors

andrewjpage avatar anujg30 avatar ar0ch avatar rrwick avatar takadonet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stringmlst's Issues

some samples came back with traceback error or empty result

Hi,

I just installed stringMLST through bioconda on my VM (Ubuntu 18.04 LTS on Windows 7 host) and ran it on a bunch of samples using E coli 1 MLST scheme. About 80% of the samples gave a ST and 3.5% returned 0, which were expected.

However, for the rest 16.5%, some returned an empty list and some gave a traceback error message. All of these problem samples were downloaded from NCBI. I had no problem running them in QC inspection or reference sequence alignment (bowtie2/samtools, etc.) so I know the files were not corrupted. Nevertheless, I did notice that most of the samples that returned an empty list were submitted from one source and contained contaminating reads from another serotype. I didn't observe any pattern for the ones that gave me a Traceback error other than the fastq.gz files might be on the small side (<50 MB each). I was wondering if there's an explanation, and better, a fix, for these samples.

Please see examples below (Ec is the prefix I gave E. coli MLST scheme 1 when I created DB):

ST = stringMLST.py --predict -1 DRR015930_1.fastq.gz -2 DRR015930_2.fastq.gz -P Ec
print(ST)

[]

ST = stringMLST.py --predict -1 ERR1777574_1.fastq.gz -2 ERR1777574_2.fastq.gz -P Ec
print(ST)

['Traceback (most recent call last):', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 1605, in ', ' results = singleSampleTool(fastq1, fastq2, paired, k, results)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 399, in singleSampleTool', ' singleFileTool(fastq1, k)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 452, in singleFileTool', ' fileExplorer(fastq, k, non_overlapping_window)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 468, in fileExplorer', ' lines = f.readlines()', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 289, in read1', ' return self._buffer.read1(size)', ' File "/home/florathecat/anaconda3/lib/python3.6/_compression.py", line 68, in readinto', ' data = self.read(len(byte_view))', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 482, in read', ' raise EOFError("Compressed file ended before the "', 'EOFError: Compressed file ended before the end-of-stream marker was reached']

Thanks for your time.

--getMLST fails due to SSL self-signed cert error

When I tried updating my mlst database using --getMLST, it failed with ssl errors. I was able to correct the problem by importing ssl and adding in a line to ignore the certs. The attached version of stringMLST.py contains the patched code. I've confirmed it works.

patch.zip

--getMLST command error: IndexError: child index out of range

Hi all,

We have a user running stringMLST v0.6.2 --getMLST command for neisseria spp. However, the --getMLST step seems to be pulling down an empty directory called neisseria_db:

image

As you can see from the image: the n--getMLST step produces an IndexError IndexError: child index out of range and also creates an empty directory called neisseria_db which doesn't have the nmb subdirectory (which is mentioned in the quick start guide). Any advice on how we can proceed with this or if we are doing something wrong?

Best,
Nishant Gerald

No k-mer matches were found for the sample ERR026529....

Hi,
I am trying to get stringMLST to work, and I am following the Quickstart guide.
Using this command line (after pip installing, downloading database and data):
stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz
I get this result:

No k-mer matches were found for the sample ERR026529_1.fastq.gz and ERR026529_2.fastq.gz. Probable cause of the error: low quality data/too many N's in the data
Sample ST
ERR026529 ST

I checked both data and database, that seems to be alright.
OS: Ubuntu 14.04
Python: Python 3.6.0 |Anaconda 4.3.1 (64-bit)

Cut a 1.0.0 release

It's probably time to release 1.0. stringMLST has been feature complete and stable for quite some time now

BuildDB will incorrectly parse ST profile data

Created database for Listeria monocytogenes from http://bigsdb.pasteur.fr/listeria/ and it finished correctly. However, when running the prediction, one allele was missing from the results and no ST type were found.

Looks like the issue is with this section of code: https://github.com/anujg1991/stringMLST/blob/master/stringMLST.py#L438

The script always assume there is only ONE extra column at the end of the ST. In Listeria case, there are two extra columns "CC" and "lineage".

Not compatible with python3

Currently some function calls are not compatible with python 3.5+

  • Automated porting via modernize
  • Check for regressions
  • Write test cases for porting script

'{speciesName}' does not exist on PubMLST

$ stringMLST.py --getMLST -P neisseria/nmb --species neisseria  
Traceback (most recent call last):
  File "/home/hadoop/anaconda2/bin/stringMLST.py", line 4, in <module>
    __import__('pkg_resources').run_script('stringMLST==0.6.1', 'stringMLST.py')
  File "/home/hadoop/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 664, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/hadoop/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1450, in run_script
    script_code = compile(script_text, script_filename, 'exec')
  File "/home/hadoop/anaconda2/lib/python2.7/site-packages/stringMLST-0.6.1-py2.7.egg/EGG-INFO/scripts/stringMLST.py", line 275
    print(f"This usually means the provided species, '{speciesName}', does not exist on PubMLST")

Formatting of the locus files ("Allele name in locus file should be seperated by '_' or '-'")

Sorry, need to ask again something.
When I try to build the DB, I get the following:

Info: Making DB for k =  35
Info: Making DB with prefix = /exports/mm-hpc/bacteriologie/bastian/data/Cdif/cgmlst/stringMLST_db/cdif_cgmlst
Info: Log file written to  /exports/mm-hpc/bacteriologie/bastian/data/Cdif/cgmlst/stringMLST_db/cdif_cgmlst.log
Error : Allele name in locus file should be seperated by '_' or '-'

I'm unsure to what it refers to.
My fasta files with the different loci are called e.g. CD630_00010.fasta , but the names of the fasta entries are only integers (>1, >2, >3). Do I need to change these fasta headers?

In the profile file, does it then still need only the integers, even if I rename all the headers to e.g. >CD630_00010_1, >CD630_00010_2 etc?

Am I guessing correct here?

Thanks,
Bastian

0 versus empty

Hi!
I ran stringMLST on the sample from EBI SRA database ERR024377 (S. enterica):

stringMLST.py
--predict
-1 ERR024377_1.fastq.gz
-2 ERR024377_2.fastq.gz
-k 35
--prefix ERR024377
--output res.txt

and the res.txt file contains the following:

Sample aroC dnaN purE ST
ERR024377 345 342 0 0

Since the loci are totally 7 (aroC, dnaN, hemD, hisD, purE, sucA,thrA) , what is the difference between purE which got a 0 and hemD, hisD, sucA, thrA which got an empty value?

Bests,
-A

Memory usage balloons when building rMLST kmer-database

Hi,

Thanks for your work on this tool. I'm interested in pairing it with rMLST like you describe in the original paper, so I obtained the database and have been trying to build a kmer database for classifying samples. However, the memory required seems to be far above 128 GB, and I've killed the build script before the OOM killer activates.

Any advice on building the DB? I suppose I could do it on AWS if needed, or construct the DB manually using jellyfish.

[Breaking change] Make options and syntax more obvious

Make program flags more obvious:

-P/--prefix is awkward for setting the database. Move to -db/--database

-x Replaced with -f/--force, standard syntax for forcibly running/overwriting

Update help statements to be module specific

Optional kmer depth coverage output

Output kmer depth for each allele profiled. Can be useful for determining why stringMLST fails to give type something -- e.g. too low or no of coverage -- and for non-MLST screens (such as AMR)

Different allels predicted despite mapping showing they are the same

Hi everyone,

I have some issue, which is... confusing.
I used stringMLST before, succesfully with a cgMLST scheme, and the results seem to make sense.
For another project, I also ran some data with stringMLST through the same scheme. The results also kinda make sense, but not quite. Basically, we had a few strains, which we are really, really certain that they are clonal. The result of stringMLST is that they are all slightly above the accepted threshold for clonality (which is probably not valid, but that's not the point right now).
So I had a look at the diverging allel predictions. I mapped the reads with bowtie2 to all first allels of the cgMLST (which is like 2200), and inspected the mapping of the diverging allels, manually, in a genome viewer, just to be certain.
And the diverging alles are the same between at least 2 of the samples. stringMLST predicts them different, but the mapping is the same.
The diverging allel predictions are not tremendously off (like 1 or 2 bp), but they are still wrong.
The affected allels are affected over multiple samples, so this also seems to be systematic.

Any idea what could be the cause of this?

This affects version 0.6.2, in case this matters.

Regards,
Bastian

EDIT: The very samples I am taking about are ERR2232520 and ERR2232524 .
I used this scheme https://www.cgmlst.org/ncs/schema/3560802/ , and affected allels are for these 2 samples are 03340, 04010, 08730, 08930, 10560, 13310, 16340 .
e.g. allel 08930 is predicted to be different over 5 samples, which we expect to be the same (different prediction in all 5 samples), so this is probably systematic.

Given 'random' ST type when no k-mers match any allele

When running the Strep pneu against the Listeria Mono. to see what a failure looks like I get the following result.

No k-mer matches were found for the sample . strain_R1_001.fastq and strain_R2_001.fastq. Probable cause of the error: low quality data/too many N's in the data
Sample ST Time
strain 344 15.88

Noticed in the code that the No k-mer match statement was followed by exit statement but it was commented out. I think reporting back to the user with N/A or no result would be best instead of just exiting . If someone is running in batch mode or combining individual results together, its very usefully to know which strain actually had no result. If they do not show up at all, then no idea if the software failed or because no k-mer found.

Cannot reproduce * with -z parameter

Hi!
I tried to reproduce your script with different values of the fuzzy (10,100,1000) present in the closed issue "ar0ch commented on Oct 25, 2016":

for i in {10,100,1000};~/stringMLST/stringMLST.py --predict -1 ./tests/fastqs/ERR026529_1.fastq -2 ./tests/fastqs/ERR026529_2.fastq -p -P ./tests/testdb -z $i -t
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180 306 612 269 277* 260 10174 11.21
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180* 306 612 269 277* 260 10174 12.00
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529_ 231 180* 306* 612 269 277* 260* 10174 12.05

This is my command with -z 1000 :

python3.6 stringMLST.py --predict -1 ./ERR026529_1.fastq.gz -2 ./ERR026529_2.fastq.gz -p -P ./neisseria/nmb -z 1000 -t
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529 231 180 306 612 269 277 260 10174 8.19

I don't see the "*" in adk, aroE pdhC and pgm.

I use the last version of stringMLST 0.5.1, bwa 0.7.16a-r1181, samtools 1-6, bedtools 2.26, pyfaidx is installed.

Am I missing something?
Thank you very much,
-A

ERROR while loading ST

Hello @ar0ch,

I am using stringMLST and get ST assigned in terminal but log shows several lines of `"ERROR while loading ST".

Can anybody help?

python error with --coverage

In attempting to use '--coverage' I encounter the following python error, which I have been unable to correct. Admittedly, I am not a skilled python user.

Traceback (most recent call last):
File "/home/yrh8/Tools/stringMLST/stringMLST.py", line 1613, in
getCoverage(results)
File "/home/yrh8/Tools/stringMLST/stringMLST.py", line 790, in getCoverage
allele = gene+'_'+re.sub("*", "", str(results[sample][gene]))
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat

running time

Hi Aroon,
Happy New Year.

I noticed a discrepancy between the time reported with "-t" option by stringMLST and the time measured by /usr/bin/time -v.

This is what is reported by stringMLST:
Sample abcZ adk aroE fumC gdh pdhC pgm ST Time
ERR026529 231 180* 306 612 269* 277* 260 10174 7.67

And this is reported by /usr/bin/time -v

Command being timed: "./stringMLST_iss36.py --predict -P neisseria/nmb -1 ERR026529_1.fastq -2 ERR026529_2.fastq -x -t"
User time (seconds): 19.86
System time (seconds): 1.82
Percent of CPU this job got: 89%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.28

Why are they different?

Bests,
-A

Proxy issue?

Hello,
while trying to run stringMLST on a machine behind a coprorate proxy we get the following error message:

[xxx@xxx ~]$ stringMLST.py --getMLST -P datasets/ --species all
Using a kmer size of 35 for all databases.
Preparing: Achromobacter spp.
Traceback (most recent call last):
File "/usr/local/bin/stringMLST.py", line 1639, in
profileURL = get_links(key,schemes)
File "/usr/local/bin/stringMLST.py", line 263, in get_links
xml = urlopen(URL)
File "/usr/lib64/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/usr/lib64/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/usr/lib64/python2.7/urllib.py", line 359, in open_http
return self.http_error(url, fp, errcode, errmsg, headers)
File "/usr/lib64/python2.7/urllib.py", line 372, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "/usr/lib64/python2.7/urllib.py", line 665, in http_error_301
return self.http_error_302(url, fp, errcode, errmsg, headers, data)
File "/usr/lib64/python2.7/urllib.py", line 635, in http_error_302
data)
File "/usr/lib64/python2.7/urllib.py", line 661, in redirect_internal
return self.open(newurl)
File "/usr/lib64/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/usr/lib64/python2.7/urllib.py", line 437, in open_https
h.endheaders(data)
File "/usr/lib64/python2.7/httplib.py", line 1013, in endheaders
self._send_output(message_body)
File "/usr/lib64/python2.7/httplib.py", line 864, in _send_output
self.send(msg)
File "/usr/lib64/python2.7/httplib.py", line 826, in send
self.connect()
File "/usr/lib64/python2.7/httplib.py", line 1236, in connect
server_hostname=sni_hostname)
File "/usr/lib64/python2.7/ssl.py", line 350, in wrap_socket
_context=self)
File "/usr/lib64/python2.7/ssl.py", line 611, in init
self.do_handshake()
File "/usr/lib64/python2.7/ssl.py", line 833, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:579)

All our openssl libs and libraries are installed.

Alelle order not consistent when have different sample names

The order of the alleles can be different when using different strain name
i.e

Sample mdh gyrB recA fumC purAicd adk ST
strain1 1 1 2 3 4 5 5

Sample purAicd adk mdh gyrB recA fumC ST
strain2 4 5 4 1 1 2 3 5

Makes it difficult to merge the two individual runs together into a single report.

IndexError: child index out of range

If I use the next line:
stringMLST.py --getMLST -P Listeria --species Listeria monocytogenes

I get the following error:

Preparing: Listeria
Traceback (most recent call last):
File "/mnt/disk1/bin/miniconda3/bin/stringMLST.py", line 1562, in
profileURL, loci = get_links(dbRoot, filePrefix, species)
File "/mnt/disk1/bin/miniconda3/bin/stringMLST.py", line 268, in get_links
profileURL = child[1].text
IndexError: child index out of range

Git repo too big with example dataset

When downloading the master branch as a zip file, the total size is 160.98M which is HUGE for a code base. Perhaps replace example read files folder with a script that would download them separately instead.

It is taking over 15 minutes to clone the repo now.

How do you define an unknown allele in the profile file?

Hey there,

I'm attempting to run stringMLST for a cgMLST scheme.
I'm about to build everything to make it run, I have the allele fasta files, and will now create the profile file, where all the cgMLSt types and the alleles are listed.
Since it's a cgMLSt scheme, not all genes do have an allele in all cases.
I have multiple profiles, where some alleles are put down as "?" in the definition at the cgMLST website.
Do I also put a ? in the profile file? Or something else?
Any advice :)?
(I'll try anyways with a ? in there)

Bastian

Cannot build db with lower case characters sequence in fasta file

Could not built database because some of my sequences contain lower case characters. Got the following stack trace:

Info: Making DB for k = 35
Info: Making DB with prefix = LM
Traceback (most recent call last):
File "/share/apps/stringMLST/stringMLST.py", line 1079, in
makeCustomDB(config,k,dbPrefix)
File "/share/apps/stringMLST/stringMLST.py", line 671, in makeCustomDB
formKmerDB(configDict,k,output_filename)
File "/share/apps/stringMLST/stringMLST.py", line 613, in formKmerDB
string = key+'\t'+key1+'\t'+str(kmerDict[key][key1]).replace(" ","")+'\n'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Looks like the reverseComplement function only works on upper case character. Perhaps you should convert all input sequence to upper case first before attempting to do reverse complement.

Change of license

stringMLST will be relicensed into CC Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) on Friday, April 21 2017.

We feel CC BY-NC-SA 4.0 accurately reflects our original intent and that no additional restrictions or limitations are placed on users of stringMLST with this change. We hope this will clarify the licensing of stringMLST, and facilitate third-party packages and derivative works. stringMLST will remain Open Source Software and free for noncommercial use.

-z parameter says 500 is not an interger

I have been using the stringMLST with a pubMLST database. However it is missing new alleles that have been identifed in the isolates using two other methods. I tried to adjust the -z parameter incase that would help but I only got the response that the various numbers I tried - 100, 500, 400, 350 were not intergers.

Doc for output format from `stringMLST.py --predict`

Could you please document the output format of stringMLST.py --predict and how the numbers and fields relate to the database files.

For example what are the counts in this table:

Sample	abcZ	adk	aroE	fumC	gdh	pdhC	pgm	ST
ERR026529	231	180	306	612	269	277	260	10174

Thanks,
Stephen

Enterococcus faecium issue

Dear string MLST author,
we encountered an issue while testing your software on Enterococcus faecium. The error reported is:

./stringMLST.py --predict -1 ~/SRR980587/SRR980587_1.fastq -2 ~/SRR980587/SRR980587_2.fastq -k 21 -P ENT


Traceback (most recent call last):
  File "./stringMLST.py", line 1145, in <module>
    results = singleSampleTool(fastq1, fastq2, paired, k, results)
  File "./stringMLST.py", line 181, in singleSampleTool
    finalProfile = getMaxCount(weightedProfile, fileName)
  File "./stringMLST.py", line 334, in getMaxCount
    compare = int(re.sub("\*$", "", str(max_n[loc])))
[ValueError: invalid literal for int() with base 10: ](ValueError: invalid literal for int() with base 10: )'3799252.24074'

The sample SRR980587 used for testing has been downloaded from:
www.ebi.ac.uk/ena/data/view/SRR980587.
Enterococcus faecium data are not present in dataset folder so we added it downloading allele files from PubMLST.

We created the config file following your README instructions:
[loci]
adk    datasets/Enterococcus_faecium/adk.fa
atpA    datasets/Enterococcus_faecium/atpA.fa
ddl    datasets/Enterococcus_faecium/ddl.fa
gdh    datasets/Enterococcus_faecium/gdh.fa
gyd    datasets/Enterococcus_faecium/gyd.fa
pstS    datasets/Enterococcus_faecium/pstS.fa
purK    datasets/Enterococcus_faecium/purK.fa
   
[profile]   
profile    datasets/Enterococcus_faecium/profile_Enterococcus_faecium.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.