Coder Social home page Coder Social logo

ecoli_serotyping's People

Contributors

boothmanrylan avatar calarose avatar chadlaing avatar dev-ansung avatar dorbarker avatar jamez-eh avatar kbessonov1984 avatar kevinkle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ecoli_serotyping's Issues

Multiple copies detected at the same locus

eci_2792_genome_sequence.zip
For Stx1A gene (possibly others too, haven't checked), VF module reported two overlapping copies of this same gene at the same locus:

e.g.
Contig: https://www.github.com/superphy#00856f353e3710088ec7582e30ce8578bbeb43b1/contigs/lclECI-2792NODE_8_length_178059_cov_24.9142_ID_15
Copy1 start: 174082
Copy2 start: 174073

Only one copy should be reported for the Stx1A gene in this case (probably longest).

I have supplied a genome file to reproduce this, although i would guess this occurs with any Stx1A harboring genome.

Paired-end reads

Hi,

I think one of the useful features of ectyper is the support of paired end reads, and if I understand it correctly there is no paired-end support (FASTQs need to be concatenated together). Using the paired-end information could improve the precision of the tool. Do you have any plan on implementing that?

Thanks for this fast and useful tool!

Simone

Run getting halted in between with error code b

Hi can you please help me resolve the following error?
2019-11-29 23:41:27,445 ectyper.speciesIdentification INFO GCF_001672015
2019-11-29 23:41:27,535 ectyper.subprocess_util ERROR Error in subprocess. The following command failed: ['grep', 'GCF_001672015', '/home/arya/miniconda3/lib/python3.7/site-packages/ectyper/Data/assembly_summary_refseq.txt']
2019-11-29 23:41:27,538 ectyper.subprocess_util ERROR Subprocess failed with error: b''
2019-11-29 23:41:27,539 ectyper.subprocess_util CRITICAL ectyper has stopped
subprocess failure
-Arya

Improve species identification reliability on edge cases or poor quality inputs

  • the RefSeq sketch is stale and needs to be updated more frequently to match with the NCBI RefSeq database sketch.
  • Also in species prediction module implement species prediction threshold via min MASH score check in cases when all top hits are equal to MASH p-value of 1 (no certainty) due to poor WGS data quality input (e.g. truncated library). In this case, return species "-" instead of some erroneous prediction

Sequence not shown. --sequence flag is depricated

2023-10-28 20:24:52,090 ectyper INFO Database structure QC is OK at /usr/local/lib/python3.8/site-packages/ectyper/Data/ectyper_alleles_db.json
2023-10-28 20:24:52,091 ectyper INFO Starting ectyper v1.0.0 running on allele database v1.0 (11-03-2020)
2023-10-28 20:24:52,091 ectyper INFO Output_directory is /content/output1
2023-10-28 20:24:52,091 ectyper INFO Command-line arguments Namespace(cores=1, dbpath=None, debug=False, input='U00095.3.fasta', output='/content/output1', percentCoverageHtype=50, percentCoverageOtype=90, percentIdentityHtype=95, percentIdentityOtype=90, refseq=None, sequence=True, verify=False)
2023-10-28 20:24:52,091 ectyper.speciesIdentification INFO RefSeq sketch (refseq.genomes.k21s1000.msh) and assembly meta data (assembly_summary_refseq.txt) is in good health and does not need to be downloaded
2023-10-28 20:24:52,092 ectyper INFO Gathering genome files
2023-10-28 20:24:52,092 ectyper.genomeFunctions INFO Using genomes in file U00095.3.fasta
2023-10-28 20:24:52,092 ectyper INFO Identifying genome file types
2023-10-28 20:24:52,246 ectyper.genomeFunctions INFO Folowing files were not found in the input:
2023-10-28 20:24:52,273 ectyper.genomeFunctions INFO Creating combined serotype and identification fasta file
2023-10-28 20:24:52,292 ectyper INFO Assembling final list of fasta files
2023-10-28 20:24:52,305 ectyper INFO Standardizing the E.coli genome headers based on file names
2023-10-28 20:25:02,847 ectyper.predictionFunctions INFO Predicting serotype from blast output
2023-10-28 20:25:02,925 ectyper.predictionFunctions INFO Serotype prediction completed
2023-10-28 20:25:02,932 ectyper INFO BLAST output file against reference alleles is written at /content/output1/blast_output_alleles.txt
2023-10-28 20:25:02,943 ectyper INFO Reporting results:
2023-10-28 20:25:02,943 ectyper.predictionFunctions INFO Name Species O-type H-type Serotype QC Evidence GeneScores AlleleKeys GeneIdentities(%) GeneCoverages(%) GeneContigNames GeneRanges GeneLengths Database Warnings
2023-10-28 20:25:02,943 ectyper.predictionFunctions INFO U00095.3 - O16 H48 O16:H48 - Based on 3 allele(s) wzx:1;wzy:1;fliC:1; O16-1-wzx-origin;O16-2-wzy-origin;H48-1-fliC-origin; 100;100;100; 100;100;100; gi;gi;gi; 2108337-2109584;2106060-2107226;2002110-2003606; 1248;1167;1497; v1.0 (11-03-2020) -
2023-10-28 20:25:02,944 ectyper INFO
ECTyper has finished successfully.

Sequence is equal to True but it is not shown.

I am trying to use this tool to find out the O antigen and H antigen region specifically. So I would be grateful if you could help me out by helping me see the sequence or get the gene range region as a fasta file. @kbessonov1984

Switch off E.coli serotype prediction for E.albertii stains

There is little value of E.coli serotype prediction for other Escherichia species such as E.albertii. It was observed that serotype is predicted for E.albertii even when coverage of the closest reference allele is only 16%. The non-E.coli samples such as E.albertii follow different nomenclature. I will switch off the serotype prediction and reporting for non-E.coli genomes as it is confusing to report E.coli serotype for non-E.coli sample and does not bring any additional value to the end user even though there is some degree of relatedness as reported here

The O-polysaccharide structure and the O-antigen gene cluster of E. albertii HK18069 are related to those of Esherichia coli O55 and E. coli O128 reported earlier. Read more from here

Genome name not extracted from filepath

The genome name is not being extracted from the filepath. I'm using version 0.9.0 via Conda.

Running $ ectyper -i /path/to/file/ECR.fasta has the following error:

Traceback (most recent call last):
  File ".../ectyper", line 13, in <module>
    ectyper.run_program()
  File ".../ectyper.py", line 101, in run_program
    args
  File ".../genomeFunctions.py", line 115, in get_genome_names_from_files
    files_dict[sample]["modheaderfile"] = r["newfile"]
KeyError: 'ECR'

The files_dict is
{'/path/to/file/ECR.fasta': {'species': 'Escherichia coli', 'filepath': '/path/to/file/ECR.fasta'}}

The expected files_dict is
{'ECR': {'species': 'Escherichia coli', 'filepath': '/path/to/file/ECR.fasta'}}

-o argument is required

Hello,
I recommend to add in the instruction that -o argument is required. On my system without the output dir specified, I got the error
"TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Adding the -o argument solved the issue.

Refseq currency check ignores the -r (--refseq) argument

When using the -r (or --refseq) option, the database currency check nonetheless still uses the default location (file/Data/). Consequence: users get a write permission error and ectyper abends.

In def get_refseq_mash():
.
.
.
targetpath = os.path.join(os.path.dirname(file),"Data/refseq.genomes.k21s1000.msh")

if bool_downloadMashRefSketch(targetpath):

TODO: Add O-antigen subtype information to the output

There is a talk and reference and diagnostic lab requests for adding O-antigen subtype information due to different phenotypic and clinical manifestations associated with these subtypes. The subtypes of interest are O18ab/ac, O28ac/ab, O112ab/ac, O125ab/ac, O128ab/ac, O141ab/ac, O174ab/ac.

Challenge: Some of the above pairs are very similar genetically and are difficult to resolve.

Possible solution: Incorporate additional metadata into the ECTYPER alleles database. Provide a command line option for user to get subtype information in the O-type field in the output (e.g. --with-O-subtypes). Add additional warning message to alert user to possible limit of resolution due to high degree of similarity between subgroups.

Failed lookup of assembly GCF_000092525

Looks like assembly GCF_000092525 is found in the refseq masher database but not in assembly_summary_refseq.txt

Version 0.9.0

Stacktrace below:

2019-12-20 13:53:46,398 ectyper.genomeFunctions INFO     Creating combined serotype and identification fasta file
2019-12-20 13:55:26,455 ectyper      INFO     Assembling final list of fasta files
2019-12-20 13:55:34,780 ectyper.speciesIdentification INFO     MASH species RefSeq top hit GCF_000092525.1_ASM9252v1_genomic.fna.gz with distance 0.000830728 and shared hashes ratio 966/1000
2019-12-20 13:55:34,781 ectyper.speciesIdentification INFO     GCF_000092525
2019-12-20 13:55:34,836 ectyper.subprocess_util ERROR    Error in subprocess. The following command failed: ['grep', 'GCF_000092525', '/Galaxy/_conda/envs/[email protected]/lib/python3.7/site-packages/ectyper/Data/assembly_summary_refseq.txt']
2019-12-20 13:55:34,837 ectyper.subprocess_util ERROR    Subprocess failed with error: b''
2019-12-20 13:55:34,837 ectyper.subprocess_util CRITICAL ectyper has stopped
subprocess failure

Unable to identiify species, even in the representative genome of E.coli

First, I want to thank you for your work in this pipeline, but I have been trying to run ECTyper since yesterday without success and it seems to be a problem with it.

  1. I created a new conda environment with conda create --name ectyper
  2. I installed the module with conda install -c bioconda ectyper
  3. I downloaded the representative genome of E.coli (Escherichia coli O157:H7 str. Sakai) Refseq: NC_002695.2
  4. Move the file to a new directory and rename O157H7.fasta
  5. Execute Ectyper with ectyper -i O157H7.fasta --verify -o output_dir

And it the results indicates that

2022-10-05 15:00:13,591 ectyper.predictionFunctions INFO     Name	Species	O-type	H-type	Serotype	QC	Evidence	GeneScores	AlleleKeys	GeneIdentities(%)	GeneCoverages(%)	GeneContigNames	GeneRanges	GeneLengths	Database	Warnings
2022-10-05 15:00:13,591 ectyper.predictionFunctions INFO     O157H7	-	-	-	-:-	WARNING (WRONG SPECIES)	-	-							v1.0 (11-03-2020)	Sample identified as -: serotyping results are only available for E.coli samples.If sure that sample is E.coli run without --verify parameter.Sample was not identified as valid E.coli sample but as -
2022-10-05 15:00:13,591 ectyper      INFO     
ECTyper has finished successfully. 

It seems to identify correctly the serotype without the --verify argument but I need to assign the species as E.coli

Also, it seems to be something with MASH and the database, because previous to that result I get this:

2022-10-05 15:00:13,588 ectyper.speciesIdentification INFO     Following top hits returned by MASH ['GCF_000002435.1_GL2_genomic.fna.gz', 'GCF_000003955.1_ASM395v1_genomic.fna.gz', 'GCF_000005845.2_ASM584v2_genomic.fna.gz', 'GCF_000006665.1_ASM666v1_genomic.fna.gz', 'GCF_000006825.1_ASM682v1_genomic.fna.gz', '']
2022-10-05 15:00:13,589 ectyper.speciesIdentification WARNING  
Top MASH sketch hit GCF_000002435.1_GL2_genomic.fna.gz with 1/1000 shared hashes.
Could not assign species based on MASH distance to reference sketch file.
Due to either:
1. MASH sketch meta data accessions do not start with the GCF_ prefix in assembly_summary_refseq.txt or
2. Number of shared hashes to reference is less than 100 (i.e. too distant).
3. Genome coverage is very limited causing species verification to fail.
If sample is E.coli, try running without --verify parameter

but GCF_000002435.1 is the ID of Giardia lamblia ATCC 50803

I also tried with the docker version (use docker pull kbessonov/ectyper:1.0.0 because docker pull kbessonov/ectyperdoest work) but I get the same issues.

INFO     Name	Species	O-type	H-type	Serotype	QC	Evidence	GeneScores	AlleleKeys	GeneIdentities(%)	GeneCoverages(%)	GeneContigNames	GeneRanges	GeneLengths	Database	Warnings
2022-10-05 18:22:37,697 ectyper.predictionFunctions INFO     input	-	-	-	-:-	WARNING (WRONG SPECIES)	-	-							v1.0 (11-03-2020)	File /ectyper/input.fasta not found!Sample was not identified as valid E.coli sample but as -
2022-10-05 18:22:37,698 ectyper      INFO     
ECTyper has finished successfully.

The Galaxy version seems to be working fine at least, but I need this to work locally for APECtyper

Can you please explain the QC?

00104c81-2985-4943-9502-9143a1eee412 Escherichia coli UMEA 3053-1 O75 H5 O75:H5 NA - Based on 3 allele(s) wzx:1.000;wzy:1.000;fliC:1.000; -
0013e2ce-394b-4fc5-a356-abadd635fa00 Escherichia coli 2-177-06_S4_C2 O132 H10 O132:H10 NA - Based on 3 allele(s) wzx:0.999;wzy:0.999;fliC:1.000; -

Eg the two records above - the QC column is NA, confidence - output looks sensible against eg MLST though - can you explain how EcTyper does this and what the above means?

Thanks

TODO: Make tool compatible with metadata or mixed reads datasets

Currently version 1.0.0 of the tool only allows to type single culture/pure E.coli inputs. Since often biological inputs are contaminated, mixed culture or of metadata type, there is a growing need to add read binning capability to the tool.
This feature is non-trivial especially for closely related E.coli species
Need to

  • explore existing solutions
  • test feasibility

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.