Coder Social home page Coder Social logo

pha4ge / hamronization Goto Github PK

View Code? Open in Web Editor NEW
131.0 18.0 25.0 5.53 MB

Parse multiple Antimicrobial Resistance Analysis Reports into a common data structure

License: GNU Lesser General Public License v3.0

Python 92.50% Shell 6.98% Dockerfile 0.52%
bioinformatics antimicrobial-resistance parsers data-harmonization

hamronization's Introduction

Python package DOI Docs English Docs English

hAMRonization

This repo contains the hAMRonization module and CLI parser tools combine the outputs of 18 (as of 2022-09-25) disparate antimicrobial resistance gene detection tools into a single unified format.

This is an implementation of the hAMRonization AMR detection specification scheme which supports gene presence/absence resistance and mutational resistance (if supported by the underlying tool).

This supports a variety of summary options including an interactive summary.

hAMRonization overview

Installation

This tool requires python>=3.7 and pandas and the latest release can be installed directly from pip, conda, docker, this repository, or from the galaxy toolshed:

pip install hAMRonization

PyPI version PyPI downloads

Or

conda create --name hamronization --channel conda-forge --channel bioconda --channel defaults hamronization

version-on-conda conda-download last-update-on-conda

Or to install using docker:

docker pull finlaymaguire/hamronization:latest

Or to install the latest development version:

git clone https://github.com/pha4ge/hAMRonization
pip install hAMRonization

Alternatively, hAMRonization can also be installed and used in galaxy via the galaxy toolshed.

Usage

NOTE: Only the output format used in the "last updated" version of the AMR prediction tool has been tested for accuracy. Older tool versions or updates which lead to a change in output format may not work. In theory, this should only be a problem with major version changes but not all tools follow semantic versioning. If you encounter any issues with newer tool versions then please create an issue in this repository.

usage: hamronize <tool> <options>

Convert AMR gene detection tool output(s) to hAMRonization specification format

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Tools with hAMRonizable reports:
  {abricate,amrfinderplus,amrplusplus,ariba,csstar,deeparg,fargene,groot,kmerresistance,resfams,resfinder,mykrobe,pointfinder,rgi,srax,srst2,staramr,tbprofiler,summarize}
    abricate            hAMRonize abricate's output report i.e., OUTPUT.tsv
    amrfinderplus       hAMRonize amrfinderplus's output report i.e., OUTPUT.tsv
    amrplusplus         hAMRonize amrplusplus's output report i.e., gene.tsv
    ariba               hAMRonize ariba's output report i.e., OUTDIR/OUTPUT.tsv
    csstar              hAMRonize csstar's output report i.e., OUTPUT.tsv
    deeparg             hAMRonize deeparg's output report i.e.,
                        OUTDIR/OUTPUT.mapping.ARG
    fargene             hAMRonize fargene's output report i.e., retrieved-
                        genes-*-hmmsearched.out
    groot               hAMRonize groot's output report i.e., OUTPUT.tsv (from `groot
                        report`)
    kmerresistance      hAMRonize kmerresistance's output report i.e., OUTPUT.res
    resfams             hAMRonize resfams's output report i.e., resfams.tblout
    resfinder           hAMRonize resfinder's output report i.e.,
                        ResFinder_results_tab.txt
    mykrobe             hAMRonize mykrobe's output report i.e., OUTPUT.json
    pointfinder         hAMRonize pointfinder's output report i.e.,
                        PointFinder_results.txt
    rgi                 hAMRonize rgi's output report i.e., OUTPUT.txt or
                        OUTPUT_bwtoutput.gene_mapping_data.txt
    srax                hAMRonize srax's output report i.e., sraX_detected_ARGs.tsv
    srst2               hAMRonize srst2's output report i.e., OUTPUT_srst2_report.tsv
    staramr             hAMRonize staramr's output report i.e., resfinder.tsv
    tbprofiler          hAMRonize tbprofiler's output report i.e., OUTPUT.results.json
    summarize           Provide a list of paths to the reports you wish to summarize

To look at a specific tool e.g. abricate:

>hamronize abricate -h 
usage: hamronize abricate <options>

Applies hAMRonization specification to output from abricate (OUTPUT.tsv)

positional arguments:
  report                Path to tool report

optional arguments:
  -h, --help            show this help message and exit
  --format FORMAT       Output format (tsv or json)
  --output OUTPUT       Output location
  --analysis_software_version ANALYSIS_SOFTWARE_VERSION
                        Input string containing the analysis_software_version for abricate
  --reference_database_version REFERENCE_DATABASE_VERSION
                        Input string containing the reference_database_version for abricate

Therefore, hAMRonizing abricates output:

hamronize abricate ../test/data/raw_outputs/abricate/report.tsv --reference_database_version 3.2.5 --analysis_software_version 1.0.0 --format json

To parse multiple reports from the same tool at once just give a list of reports as the argument, and they will be concatenated appropriately (i.e. only one header for tsv)

hamronize rgi --input_file_name rgi_report --analysis_software_version 6.0.0 --reference_database_version 3.2.5 test/data/raw_outputs/rgi/rgi.txt test/data/raw_outputs/rgibwt/Kp11_bwtoutput.gene_mapping_data.txt

You can summarize hAMRonized reports regardless of format using the 'summarize' function:

> hamronize summarize -h
usage: hamronize summarize <options> <list of reports>

Concatenate and summarize AMR detection reports

positional arguments:
  hamronized_reports    list of hAMRonized reports

optional arguments:
  -h, --help            show this help message and exit
  -t {tsv,json,interactive}, --summary_type {tsv,json,interactive}
                        Which summary report format to generate
  -o OUTPUT, --output OUTPUT
                        Output file path for summary

This will take a list of report and create single sorted report in the specified format just containing the unique entries across input reports. This can handle mixed json and tsv hamronized report formats.

hamronize summarize -o combined_report.tsv -t tsv abricate.json ariba.tsv

The interactive summary option will produce an html file that can be opened within the browser for navigable data exploration (feature developed with @alexmanuele).

Using within scripts

Alternatively, hAMRonization can be used within scripts (the metadata must contain the mandatory metadata that is not included in that tool's output, this can be checked by looking at the CLI flags in hamronize <tool> --help):

import hAMRonization
metadata = {"analysis_software_version": "1.0.1", "reference_database_version": "2019-Jul-28"}
parsed_report = hAMRonization.parse("abricate_report.tsv", metadata, "abricate")

The parsed_report is then a generator that yields hAMRonized result objects from the parsed report:

for result in parsed_report:
      print(result)

Alternatively, you can use the .write attribute to export all results left in the generator to a file (if a filepath isn't provided, this will write to stdout).

parsed_report.write('hAMRonized_abricate_report.tsv')

You can also output a json formatted hAMRonized report:

parsed_report.write('all_hAMRonized_abricate_report.json', output_format='json')

If you want to write multiple reports to one file, this .write method can accept append_mode=True to append rather than overwrite the output file and not include the header (in tsv format).

parsed_report.write('all_hAMRonized_abricate_report.tsv', append_mode=True)

Implemented Parsers

Currently implemented parsers and the last tool version for which they have been validated:

  1. abricate: last updated for v1.0.0
  2. amrfinderplus: last updated for v3.10.40
  3. amrplusplus: last updated for c6b097a
  4. ariba: last updated for v2.14.6
  5. csstar: last updated for v2.1.0
  6. deeparg: last updated for v1.0.2
  7. fargene: last updated for v0.1
  8. groot: last updated for v1.1.2
  9. kmerresistance: late updated for v2.2.0
  10. mykrobe: last updated for v0.8.1
  11. pointfinder: last updated for v4.1.0
  12. resfams: last updated for hmmer v3.3.2
  13. resfinder: last updated for v4.1.0
  14. rgi (includes RGI-BWT) last updated for v5.2.0
  15. srax: last updated for v1.5
  16. srst2: last updated for v0.2.0
  17. staramr: last updated for v0.8.0
  18. tbprofilder: last updated for v3.0.8

Implementation Details

hAMRonizedResult Data Structure

The hAMRonization specification is implemented in the hAMRonizedResult dataclass.

This is a simple datastructure that uses positional and key-word args to distinguish mandatory from optional hAMRonization fields. It also uses type-hinting to validate the supplied values are of the correct type

Each parser follows a similar strategy, using a common interface. This has been designed to match the biopython SeqIO parse function

>>> import hAMRonization
>>> filename = "abricate_report.tsv"
>>> metadata = {"analysis_software_version": "1.0.1", "reference_database_version": "2019-Jul-28"}
>>> for result in hAMRonization.parse(filename, metadata, "abricate"):
...    print(result)

Where the final argument to the hAMRonization.parse command is whichever tool is being parsed.

hAMRonizedResultIterator

An abstract iterator is then implemented to ingest a given AMR tool's report (via the appropriate subclassed implementation), hAMRonize results i.e. translate the original inputs to the fields in the hAMRonization specification, and yield a stream of hAMRonizedResult dataclasses.

This iterator also implements a write function to enable outputting the contents to a output stream or filehandle in either tsv or json format.

Tool-specific Iterators

Each tool has a specific subclass of this abstract hAMRonizedResultIterator e.g. AbricateIO.AbricateIterator.

These include an attribute containing the mapping of the tools original output report fields to the hAMRonized specification fields (self.field_mapping), as well as handling specifying any additional required metadata.

The parse method of these subclasses then implements the tool-specific parsing logic required. This is typically a simple csv.DictReader but can be more complex such as the json parsing of resfinder output, or the modification of output fields required to better fit some tools into the hAMRonization specification.

Contributing

We welcome contributions for users in any form (from github issues flagging problems/requests) to pull requests of bug fixes or adding new parsers.

Setting up a Development Environment

First fork this repository and set up a development environment (replacing YOURUSERNAME with your github username:

git clone https://github.com/YOURUSERNAME/hAMRonization
conda create -n hAMRonization 
conda activate hAMRonization
cd hAMRonization
pip install pytest flake8
pip install -e .

Testing and Linting

On every commit github actions automatically runs tests and linting to check the code. You can manually run these in your development environment as well.

To run a full set of integration tests:

pushd test
bash run_integration_test.sh
popd

To run unit tests that verify parsing validity for each tool as well as generation of valid summaries you can use pytest:

pip install pytest
pushd test
pytest
popd

Finally to run linting and check whether your code matches the project code style:

pushd hAMRonization
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=20 --max-line-length=127 --statistics
popd

Adding a new parser

If you wish to add a parser for a new tool here are the main steps required:

  1. Add an entry into _RequiredToolMetadata and _FormatToIterator in hAMRonziation/__init__.py which points to the appropriate ToolNameIO.py containing the tool's Iterator subclass

  2. In ToolNameIO.py add a required_metadata list containing any mandatory fields not implemented by the tool

  3. Then add a class ToolNameIterator(hAMRonizedResultIterator) and implement the __init__ methods with the approriate mapping (self.field_mapping), and metadata (self.metadata).

  4. To this class, add a parse method which reads an opened file stream into a dictionary per line/result (matching the keys of self.field_mapping) and yields the output of self.hAMRonize being applied to that dictionary.

  5. To add a CLI parser for the tool, create a python file in the parsers directory:

    from hAMRonization import Interfaces
    if __name__ == '__main__': 
        Interfaces.cli_parser('toolname')
    

Alternatively, the hAMRonized_parser.py can be used as a common script interface to all implemented parsers.

  1. Finally, following the template in test/test_parsing_validity.py, please generate a unit test that ensures the parser is working as you intend it to!

If you have any questions about any of this or need any help, please file an issue.

FAQ

  • What's the difference between an Antimicrobial Resistance 'Result' and 'Report'?
    • For the purposes of this project, a 'Report' is an output file (or collection of files) from an AMR analysis tool. A 'Result' is a single entry in a report. For example, a single line in an abricate report file is a single Antimicrobial Resistance 'Result'.

Known Issues

Here are some known issues that we would welcome input on trying to solve!

Limitations of specification

  • mandatory fields: gene_symbol and gene_name are confusing and not usually both present (only consistently used in AFP). Means tools either need 1:2 mapping i.e. single output field maps to both gene_symbol and gene_name OR have fragile text splitting of single field that won't be robust to databases changes. Current solution is 1:2 mapping e.g. staramr

  • inconsistent nomenclature of terms being used in specification fields: target, query, subject, reference. Need to stick to one name for sequence with which the database is being searched, and one the hit that results from that search.

  • sequence_identity: is sequence type specific %id amino acids != %id nucleotide but does this matter?

  • coverage_depth seems to include both tool fields that are average depth of read and just plain overall read-count,

  • contig_id isn't general enough when some tools this ID naturally corresponds to a read_name (deepARG), individual ORF (resfams), or protein sequence (AFP with protein input): change to query_id_name or similar?

hamronization's People

Contributors

alexmanuele avatar antunderwood avatar awitney avatar cimendes avatar danymatute avatar dfornika avatar fmaguire avatar imendes93 avatar jodyphelan avatar pvanheus avatar raphenya avatar thanhleviet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hamronization's Issues

RgiIO.py: Typo in line 79

In pha4ge/hAMRonization/blob/master/hAMRonization/RgiIO.py
Typo in line 79: 'Percentage Length of '

HTH,
Svetlana

Flag to filter report for genomic/non-genomic audience

For the interactive report or tabular report possibly have an option to just summarise results (genome, gene, tool, versions, phenotype annotation) and the full genomics results (i.e., the whole spec with start/stop contig coverage etc).

Obtain specification field data information from JSON schema

In the hAMRonizedResult class definition, the field terms for the parsers, and their value types, should be obtained from the schema JSON file and not be hardcoded into the tool. The necessary file is already provided in the schema/ directory. A parser should be included to retrieve this information directly from the file, facilitating the update of the field terms when necessary.

โฌ‡๏ธ
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L13-L52

[BUG] `KeyError: 'reference_database_name'` when running summarize

Describe the bug

I get the following error when running with summarize

Warning: <_io.TextIOWrapper name='WAL001-megahit.mapping.potential.ARG.deeparg.json' mode='r' encoding='UTF-8'> report is empty
Traceback (most recent call last):
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/Interfaces.py", line 299, in generic_cli_interface
    hAMRonization.summarize.summarize_reports(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/summarize.py", line 752, in summarize_reports
    combined_reports = combined_reports.sort_values(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
    return func(*args, **kwargs)
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in sort_values
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in <listcomp>
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/generic.py", line 1849, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'reference_database_name'

Input

hamronize \
    summarize \
    <huge_list_of_jsons> \
    -t interactive \
     \
    -o hamronization_combined_report.html

Input file
I can send a zip of the entire privately if necessary (includes unpublished data)

Error log
See above

hAMRonization Version
1.1.0

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: SUSE Linux Enterprise High Performance Computing 15 SP1
  • Version: hAMRronization 1.1.0

Additional context
Add any other context about the problem here.
If applicable, include dependency versions such as pandas version and Python version.

Galaxy implementation

To facilitate adding to production workflows we should make a galaxy tool wrapper for hAMRonization.

There are special converter type tools that should make this a little easier.

Fix issue of very similar runs falsely combining results in summary

If the exact same tool is run with different settings (but all the other metadata stays the same such as version) summarize falsely combines them e.g., running RGI on contigs and reads, hamronizing each output, and them combining them in a summary will end up with an interactive summary that both RGI contig and RGI-bwt results are from the same simple run of RGI.

Solution: Summarize should try and treat each file separately and assign a new config # if multiple hAMRonized files are supplied

https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/summarize.py#L16

Wrong Docker link

Dear Finley,
I am Giovanni Iacono from EFSA, we had a video call the other day.
The compilation of the Dockerfile from https://github.com/pha4ge/hAMRonization_workflow fails at task 18 RUN cd data/test && bash get_test_data.sh && cd ../..

The reason is that in the current repository https://github.com/pha4ge/hamronization the data folder is not present. This folder is present in https://github.com/pha4ge/hAMRonization_workflow.

Also a question, the Dockerfile in https://github.com/pha4ge/hAMRonization_workflow installs only the parsers, correct ?

Output options

Currently the only output options for parsers are tsv or json printed to stdout.

While users can redirect from CLI it might be nice to give an output_file option

help understanding resfinder run

Hi devs, I'm trying to run hamronize on some results I generated from the latest resfinder docker image. Here is the command I am trying to run:

hamronize resfinder resfinder/resfinder/results/ResFinder_results_tab.txt  --reference_database_version db_v_1 --analysis_software_version tool_v_1 --output hamr_out/resfinder_out.tsv  

I get the following error:

Traceback (most recent call last):
  File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
    output_format=args.format)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
    first_result = next(self)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
    return next(self.hAMRonized_results)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/ResFinderIO.py", line 48, in parse
    report = json.load(handle)
  File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The only json produced by my resfinder run is the std_format_under_development.json, and I don't think hamronize is indicating it wants to use this file over the resfinder results table. Is this an issue with my resfinder run (hamronize expects resfinder to produce different jsons), an issue with my input arguments, or something else?

AMR Variant detection - Parsers to be updated

The following parsers need to be updated to comply with the new spec:

The following parsers currently are skipping the variant detection results:

The following parsers should be added:

Using owl:equivalentClass to connect things semantically

I noticed the JSON-LD spec has things like

  "owl:equivalentClass": {
    "@id": "edam_data:1050"
  },

The cross-referencing to ontology terms is great, but I think owl:equivalentClass is too strong. It is harmless right now but in future if that was somehow converted into native owl context for this and other terms (as a result of data fedration etc) it brings along reasoner baggage that might not be desired. For reaching across vocabularies, how about skos:closeMatch or skos:exactMatch? In particular, very few GenEpiO terms are used in owl context as rdf:Properties so to link to them via owl:equivalentClass might lead to misinterpretation by some brainless computer somewhere.

After another round of GenEpiO edits I'll circle back to check out the term mappings here.

Cheers!

Flag overlapping ranges in hAMRonization

If ranges of detected AMR genes overlap in genomic coords >90% then flag them in the summary html somehow.

Problem: indices are 1-based AND 0-based in different tools (and many tools don't have genomic coords at all)

Summary options

Create a summary report of just the AMR genes detected per genome with linked detailed reports.

One line per sample
One line per software

Global variable refactoring

Use of global variables should probably be removed, this should fall out of larger refactoring to remove code duplication.

Should tackle #22 and facilitate #23

ORF_ID missing once RGI report hAMRonized

Hi Finlay,

when running ORFs through the RGI-hARMonization pipeline, the ORF_ID (important for final hAMR report) is skipped.

Here https://github.com/SvetlanaUP/hAMRonization/blob/master/hAMRonization/RgiIO.py
you can see that I fixed ORF_ID self.field_mapping manually (Fixed 'ORF_ID': 'None', 'Contig': 'input_sequence_id' (should be opposite) ).

This solution works for us, but maybe for the future would be good to have it defined in the RgiIO.py, e.g. if there is no 'Contig' use 'ORF_ID'.

Thanks,
Svetlana

Decide on a single 'authoritiative' schema format

There are several schema definition technologies available to us:

  1. JSON Schema
  2. SALAD
  3. AVRO
  4. JSON-LD

Ideally we would have a single 'authoritative' schema, and any other schema could be automatically derived from it. Which schema definition technology would make the most sense to use as the authoritative schema? Would it be possible to derive all the others from it in a robust and automated way?

Groot parser implementation

With the following single-entry output this is currently what is being parsed:

OqxA.3003470.EU370913.4407527-4408202.4553 266 657 3D648M6D

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", 'reference_database_id': "argannot"}

This is the current output:

    assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'OqxA'
    assert result.gene_name == 'OqxA.3003470.EU370913' 
    assert result.reference_database_id == 'argannot'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == 'OqxA.3003470.EU370913.4407527-4408202.4553' 
    assert result.analysis_software_name == 'groot'
    assert result.analysis_software_version == '0.0.1'
    assert result.reference_gene_length == 657
    assert result.coverage_depth == 266

As you can see, the gene_symbol, gene_name and reference_accession fields are all storing the same information.

I've a bit of trouble mapping the fields in OqxA.3003470.EU370913.4407527-4408202.4553 to the spec. The gene_name technically this is not present in the report. Should we store the same value as gene_symbol or keep is as None?
For the reference_accession, shouldn't we keep just the EU370913 value? I'm unsure of what 3003470 represents, as well as the 4407527-4408202.4553.

Any input is welcomed!

Add fARGene

This is a suggestion to add the tool fARGene to the hAMRonization tool list. fARGene detects AMR genes based on pre-defined HMM models (provided together with the tool). It would be great to have the fARGene output also standardized in form of a hAMRonization summary. The output is described in the fARGene tutorial. I attached an example output folder (command: fargene -i contigs.fasta --hmm-model class_a -o output_dir ) here: output_dir.zip

Add custom error handling

When file is passed without the expected format/fields, a generic error is thrown. This can be easily handled with a custom exception informing the user about why it's failing.

Update README

  • Improve installation instructions
  • Update parsers included
  • Add wiki with small tutorial

Simplify AntimicrobialResistanceResult.read()

The AntimicrobialResistanceGenomicAnalysisResult.read() method takes a dictionary as input and loads values into the class by matching the keys in the dictionary against the attribute names in the class.

Since each dictionary lookup could fail, each lookup is wrapped in a try: / except: block. This leads to a really verbose (and inefficient?) implementation.

There may be a simpler way to convert from a dict to our AntimicrobialResistanceGenomicAnalysisResult class via a namedtuple and/or a dataclass

Extend schema(s) to include point mutation info

Our schemas currently only incorporate resistance gene detection information, but don't include fields that are relevant to point mutations. Point mutations (and other variants like indels) are important mechanisms of antibiotic resistance and several of our tools include that type of information in their output.

Gene detection information was incorporated first because it was generally deemed to be simpler and more consistent than point mutations, but our schema should support both types of information.

PyPi not updated, 1.0.4 tarball reports version 1.0.3

Hello,

We are using haAMRonization in a pipeline (nf-core/funcscan), and I saw there was a new version of the tool so I went to update it in our pipeline.

However when I went to do so, I saw the update wasn't on bioconda, and when I tired up update the recipe - the CI test failed saying that 1.0.4 doesn't exist on pypi.

Secondly, when I went to build the package locally myself from the tarball under the releases page, I've saw that if I run hamronization --version it still reports 1.0.4.

It would be maybe good to have a release update with the correct version

(or ideally, if possible 1.0.5 and a fix #66 included. I would try to contribute this myself as it doesn't seem complicated but I'm not a python dev unfortunately)

Implementing logging and debug flag to simplify exception messages

To handle concerns raised by @cimendes in #39 and try and make issues with input selection clearer to users e.g., #54 we should add proper use of logging library, default to simple error messages and add a --debug flag to argparse which displays the full traceback.

  1. Add boolean debug flag in the generic CLI parser (default False):
    https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L217

  2. Set up the logging levels based on args.debug: https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L257 (e.g., https://stackoverflow.com/questions/14097061/easier-way-to-enable-verbose-logging)

  3. Add wrong input file exception using logging library, specifically add try: and except KeyError: at
    https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L66 that explains to the user that expected input columns can't be found and to check if they are using the correct AMR prediction file.

  4. Update the validation of input fields exception at https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L57 to use logging + debug

[BUG] Generated output does not follow CSP rules

Describe the bug
The generated output HTML uses inline JavaScript code. This is a violation of CSP rules.

Input
Any input

Input file
Any

Error log
NA

hAMRonization Version
NA

Expected behavior
The generated output file follows CSP rules.

Desktop (please complete the following information):

  • OS: [e.g. iOS] any
  • Browser [e.g. chrome, safari] any
  • Version [e.g. 22] any

Srst2 parser implementation

With the following single-entry output this is currently what is being parsed:

Sample DB gene allele coverage depth diffs uncertainty divergence length maxMAF clusterid seqid annotation
Dummy ResFinder oqxA oqxA 100.0 75.852 1snp 0.152 660 0.037 470 1995 oqxA_1_V00622; V00622; fluoroquinolone

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", "reference_database_id": 'resfinder'}

This is the current output:

   assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'oqxA'
    assert result.gene_name == 'oqxA'
    assert result.reference_database_id == 'ResFinder'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == '1995'
    assert result.analysis_software_name == 'srst2'
    assert result.analysis_software_version == '0.0.1'
    assert result.coverage_percentage == 100
    assert result.reference_gene_length == 660
    assert result.coverage_depth == 75.852

My question is regarding the reference_database_id that is currently required in the metadata, but it's being (correctly!) parsed from the report file. I suggest removing this from the required metadata fields.

staramr issue

I pip installed this tool and am trying to run the following:
hamronize staramr staramr_out/detailed_summary.tsv --reference_database_version db_v_1 --analysis_software_version tool_v_1 --format tsv --output hamr_out/staramr_out.tsv

I get the following error:

Traceback (most recent call last):
  File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
    output_format=args.format)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
    first_result = next(self)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
    return next(self.hAMRonized_results)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/StarAmrIO.py", line 40, in parse
    result['_gene_name'] = result['Gene']
KeyError: 'Gene'

I see that a new version was recently pushed. Should I use a different download method?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.