cidgoh / pathogen-genomics-package Goto Github PK

3.0 3.0 4.0 24 MB

This is the DataHarmonizer spreadsheet web application bundled with pathogen genomics data entry and validation templates

License: MIT License

HTML 100.00%

data-harmonization infectious-disease

pathogen-genomics-package's People

Contributors

Stargazers

Watchers

Forkers

pvanheus jaluvathingal ejurga

pathogen-genomics-package's Issues

MPX template: remove LIMS export field "PH_INSTRUMENT_CGN"

In the CanCOGeN DH template, the sequencing instrument information is supposed to export to PH_INSTRUMENT_CGN field whereas the MPX template exports sequencing instrument to PH_INSTRUMENT.

For some reason both are showing up in the MPX export. Can we turn off the "PH_INSTRUMENT_CGN" field in the MPX export?

I'm not even sure how that field is showing up since it's not in the MPX template's LIMS export column at all...

Canadian MPX template: age bin data overwritten when a saved file if re-opened in the DH

Currently in the Canadian MPX template a user can have a dataset in which the age value is a null value, but they have entered an age bin instead (so that the exact age is obfuscated but the general age range is shared). There is code in the DH that says "if there is a null value in the age field, put the same null value in the age bin field", which was created in order to automate populating associated fields and reduce data entry. If the user saves the dataset and re-opens it later, then the entered/saved age bin data gets overwritten with the same null value as in the age field.

This is erasing information that the user has entered. And we had the same issue in the CanCOGeN template.

You put a fix in place in the CanCOGeN template so that upon opening a file, whatever is in the age bin field will remain untouched. BUT if a user is entering fresh data, if they enter a null value for the age field, the age bin field still autofills the same null value (which they can edit themselves if they want).

Can you put the same fix into the MPX template that you put in for the CanCOGeN template to address this issue?

Thanks!

CanCOGeN Template - NML LIMS Export

I believe that data entered more than one time in the vaccination fields that are concatenated into the PH_VACCINATION_HISTORY field is getting unintentionally omitted from the NML LIMS dataharmonizer export from the CanCOGeN template.

When, for example:

Astrazeneca (Vaczevria) is entered for more than one vaccination dose name

the result is:

<Host Vaccination Status>;Astrazeneca (Vaxzevria);2022-11-01;2023-01-01;2023-03-01;2023-06-01;<Vaccination History>

rather than the expected:

<Host Vaccination Status>;Astrazeneca (Vaxzevria);2022-11-01;Astrazeneca (Vaxzevria);2023-01-01;Astrazeneca (Vaxzevria);2023-03-01;Astrazeneca (Vaxzevria);2023-06-01;<Vaccination History>

The same is occurring when the same date is used for one or more vaccination doses.

Thank you!

GRDI Template - isolated_by range is WhiteSpaceMinimizedString, but should be a menu.

The isolated_by field currently takes a string. However, the same field in the excel template over at the GRDI_AMR_One_Health repo is limited to its own isolated_by menu. This presumably should be incorporated here as well.

There appears to be an isolated_by menu in the corresponding YAML. I might suggest that, instead of having its own menu, this field take the same menu as sample_collected_by.

wiki pages - comprehensive template resources

We need to provide a summary page that consolidates template resources for each spec (e.g. add in CanCOGeN, MonkeyPox, AMBR, etc.), kind of like:
https://github.com/cidgoh/DataHarmonizer/wiki/DataHarmonizer-Template-SOPs

Update in both the DH wiki and the pathogen-genomics-package.

Wiki page title: Pathogen Genomics Template Resources

AMBR template: new release request

Some fields have been removed, others added. New picklists have been added. Ontology IDs have been added. Guidance and examples have been updated.

Can we do a new release of the AMBR template pretty please?

I tracked the changes in the version tracker and bumped the proposed version number to 2.1.1 (in red) as there were changes to fields (x), terms (y) and guidance/defs/IDs (z).

Pathogen Genomics Package "Get latest release" needs to be updated (goes to DH repo)

If you open your instance of the PGP and click on "Get latest release" under the Help button, it takes you to the latest DH release (in the DH repo), not the latest release in the PGP repo.

e.g.
Go into pathogen-genomics-package-PGPv1.3.7 data harmonizer
and clicked "get latest release"
and it took me to here:https://github.com/cidgoh/DataHarmonizer/releases
but shouldn't it be going here: https://github.com/cidgoh/pathogen-genomics-package/releases

Can we update please?

Term AGRO:00000080 coded twice under different names under field environmental_material.

Pig manure [AGRO:00000080]:
- text: Pig manure [AGRO:00000080]
- is_a: Animal manure [AGRO:00000079]

and:

Poultry litter [AGRO:00000080]:
- text: Poultry litter [AGRO:00000080]

Poultry litter is the correct one: http://purl.obolibrary.org/obo/AGRO_00000080

CanCOGeN template: new field and export reqs for NML LIMS

I added a new field called "travel history availability" and values.

Can the values from this field be added to those that are concatenated in the NML LIMS field "PH_TRAVEL" in the NML LIMS export, pretty please?

And then can we do a new release (at the same time as the AMBR release maybe?)? I bumped the template version to 2.1.2 (in red) because there were changes to fields, terms and guidance/examples.

I suggested bumping the PGP version to 2.0.1 (in red) as there were no new templates added (x), no new schemas (y), but there were changes to existing templates (z).

ReadMe - more details

Can we make it so this readme includes the Stand-Alone DataHarmonizer Functionality section from the main DataHarmonizer readme? Also the information from the old "stand-alone" installation instructions?

DH modularity and 1:N wish list

A sample can contain multiple organisms, multiple kinds of the same organism (i.e. multiple isolates), and isolates may be sequenced multiple times using different protocols or instruments. This creates a 1-to-many issue, where one sample may need to be linked to multiple organisms, isolates, library IDs, associated tests (AMR drug panels from different companies) etc.

Currently the contextual data for organisms, isolates etc from the same sample have to be entered repeatedly over and over again which creates a data entry burden for data providers.

Ideally, modularity could be created so that sample information could be entered once and linked to different isolates.
Similarly, isolate information could be entered once and linked to different libraries with different processing details/instruments.
Also similarly, libraries could be linked to multiple sequencing runs and/or associated tests.

To submit the data to LIMS or public repositories, every library or isolate or organism would need the metadata from the sample so
ideally upon export, the DH would populate that info and present each thing as a separate line in a spreadsheet.
e.g. the above situation would appear like:
sample 1 --> organism 1 --> isolate A --> library 1 --> sequence 1
sample 1 --> organism 2 --> isolate B --> library 2 --> sequence 2
sample 1 --> organism 2 --> isolate C --> library 3 --> sequence 3
sample 1 --> organism 2 --> isolate C --> library 4 --> sequence 4
sample 1 --> organism 2 --> isolate C --> library 4 --> sequence 5
*But the data provider wouldn't have to enter the different metadata multiple times.

Can we make the DH do this modular/1:N data capture and transformation (pretty please)?

CanCOGeN Template - Request for Addition to PickList

Hi all!

Following some conversation with our data partners, we'd like to request that the option for:

"Throat"

be added to the picklist for "Anatomical part".

Thank you!

broken readme images

The images referenced in this readme don't appear to get bundled in this repos package:

https://github.com/cidgoh/pathogen-genomics-package/tree/main/templates/canada_covid19/exampleInput

MPX template: Replace "PH_CANCOGEN_AUTHORS" with "PH_SEQUENCING_AUTHORS"

Found another CanCOGeN artifact in the NML LIMS export from the MPX template.

Can we please replace "PH_CANCOGEN_AUTHORS" with "PH_SEQUENCING_AUTHORS" after the "SUBMITTED_RESLT - Gene Target #5 CT Value" field in the NML export, pretty please?

In the DH template it's supposed to export as "PH_SEQUENCING_AUTHORS" so I'm not sure where "PH_CANCOGEN_AUTHORS" is coming from...

CanCOGeN and MPXV templates: fix rule for transforming Homo sapiens to Human in NML LIMS output

There are 2 places in the NML LIMS export that the field "host (scientific name)" outputs to - the field that goes into NML LIMS called PH_SPECIMEN_SOURCE and the DH field that also appears in the export file but doesn't get uploaded to LIMS called "host (scientific name)".

In the recent changes to the PGP, we lost the rule that says IF host (scientific name) is Homo sapiens THEN PH_SPECIMEN_SOURCE is Human. The NML uses "Human" instead of "Homo sapiens".

The issue we had before was that the DH is outputting the Human rule in the host (scientific name) field as well as PH_SPECIMEN_SOURCE.
i.e. IF host (scientific name) is Homo sapiens THEN host (scientific name) should be Homo sapiens and NOT Human.

In other words, we want the entered data (Homo sapiens) to be in the DH output fields (lower case after the Provenance field), but the transformed value (Human) in the NML LIMS field (PH_SPECIMEN_SOURCE, before the Provenance field).

Can we do this?
The fix is needed for the NML LIMS from both the CanCOGeN and Monkeypox templates.