Coder Social home page Coder Social logo

ingest's Introduction

ingest

Shared internal tooling for pathogen data ingest. Used by our individual pathogen repos which produce Nextstrain builds. Expected to be vendored by each pathogen repo using git subrepo.

Some tools may only live here temporarily before finding a permanent home in augur curate or Nextstrain CLI. Others may happily live out their days here.

Vendoring

Nextstrain maintained pathogen repos will use git subrepo to vendor ingest scripts. (See discussion on this decision in #3)

For a list of Nextstrain repos that are currently using this method, use this GitHub code search.

If you don't already have git subrepo installed, follow the git subrepo installation instructions. Then add the latest ingest scripts to the pathogen repo by running:

git subrepo clone https://github.com/nextstrain/ingest ingest/vendored

Any future updates of ingest scripts can be pulled in with:

git subrepo pull ingest/vendored

If you run into merge conflicts and would like to pull in a fresh copy of the latest ingest scripts, pull with the --force flag:

git subrepo pull ingest/vendored --force

Warning Beware of rebasing/dropping the parent commit of a git subrepo update

git subrepo relies on metadata in the ingest/vendored/.gitrepo file, which includes the hash for the parent commit in the pathogen repos. If this hash no longer exists in the commit history, there will be errors when running future git subrepo pull commands.

If you run into an error similar to the following:

$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''

Check the parent commit hash in the ingest/vendored/.gitrepo file and make sure the commit exists in the commit history. Update to the appropriate parent commit hash if needed.

History

Much of this tooling originated in ncov-ingest and was passaged thru mpox's ingest/. It subsequently proliferated from mpox to other pathogen repos (rsv, zika, dengue, hepatitisB, forecasts-ncov) primarily thru copying. To counter that proliferation, this repo was made.

Elsewhere

The creation of this repo, in both the abstract and concrete, and the general approach to "ingest" has been discussed in various internal places, including:

Scripts

Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.

  • notify-on-diff - Send Slack message with diff of a local file and an S3 object
  • notify-on-job-fail - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
  • notify-on-job-start - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
  • notify-on-record-change - Send Slack message with details about line count changes for a file compared to an S3 object's metadata recordcount. If the S3 object's metadata does not have recordcount, then will attempt to download S3 object to count lines locally, which only supports xz compressed S3 objects.
  • notify-slack - Send message or file to Slack
  • s3-object-exists - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
  • trigger - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
  • trigger-on-new-data - Triggers downstream GitHub Actions if the provided upload-to-s3 outputs do not contain the identical_file_message A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.

NCBI interaction scripts that are useful for fetching public metadata and sequences.

  • fetch-from-ncbi-entrez - Fetch metadata and nucleotide sequences from NCBI Entrez and output to a GenBank file. Useful for pathogens with metadata and annotations in custom fields that are not part of the standard NCBI Datasets outputs.

Historically, some pathogen repos used the undocumented NCBI Virus API through fetch-from-ncbi-virus to fetch data. However we've opted to drop the NCBI Virus scripts due to #18.

Potential Nextstrain CLI scripts

  • sha256sum - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
  • cloudfront-invalidate - CloudFront invalidation is already supported in the nextstrain remote command for S3 files. This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
  • upload-to-s3 - Upload file to AWS S3 bucket with compression based on file extension in S3 URL. Skips upload if the local file's hash is identical to the S3 object's metadata sha256sum. Adds the following user defined metadata to uploaded S3 object:
    • sha256sum - hash of the file generated by sha256sum
    • recordcount - the line count of the file
  • download-from-s3 - Download file from AWS S3 bucket with decompression based on file extension in S3 URL. Skips download if the local file already exists and has a hash identical to the S3 object's metadata sha256sum.

Potential augur curate scripts

Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (/bin/bash) does not meet this requirement. You can install Homebrew's Bash which is more up to date.

Testing

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in tests and are run as part of CI. To run these locally,

  1. Download Cram: pip install cram
  2. Run the tests: cram tests/

ingest's People

Contributors

dependabot[bot] avatar genehack avatar j23414 avatar joverlee521 avatar tsibley avatar victorlin avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ingest's Issues

Beware of rebasing/dropping parent commit of `git subrepo` automated commits

git subrepo relies on metadata in the .gitrepo file which includes the hash for the parent commit.

If this parent commit no longer exists, there can be errors in future git subrepo pull updates.
I ran into the following error when I was trying to update ingest/vendored in pathogen-repo-template

$ git subrepo pull ingest/vendored -v -d
>>> git rev-parse --verify HEAD
* Assert that working copy is clean: /Users/jlee2346/Repos/nextstrain/pathogen-repo-template
* Check for worktree with branch subrepo/ingest/vendored
  * Fetch the upstream: https://github.com/nextstrain/ingest (main).
>>> git fetch --no-tags --quiet https://github.com/nextstrain/ingest main
  * Get the upstream subrepo HEAD commit.
>>> git rev-parse FETCH_HEAD^0
  * Create ref 'refs/subrepo/ingest/vendored/fetch'.
>>> git update-ref refs/subrepo/ingest/vendored/fetch 9cfed8b1a93e881d85a71aef46e66730ca660523
* Deleting old 'subrepo/ingest/vendored' branch.
* Remove worktree: 
>>> git branch -D subrepo/ingest/vendored
* Create subrepo branch 'subrepo/ingest/vendored'.
  * Check if the 'subrepo/ingest/vendored' branch already exists.
  * Subrepo parent: 0fb3be579140549bd4ef1b7b3aa22a48593911b4
  * Create new commits with parents into the subrepo fetch
>>> git rev-list --reverse --ancestry-path --topo-order 0fb3be579140549bd4ef1b7b3aa22a48593911b4..HEAD
  * Create branch 'subrepo/ingest/vendored' for this new commit set .
>>> git branch subrepo/ingest/vendored 
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''

The logs show it's referencing the parent commit hash (0fb3be579140549bd4ef1b7b3aa22a48593911b4) in the .gitrepo file. This commit was one I had made locally before rebasing and pushing the new commits to GitHub.
The actual parent commit hash should be b82d26d1252897959b42d5122aff94f20bfa05ed.

transform-field-names: Skip field if `old_field_name` == `new_field_name`

Context

Prompted by nextstrain/avian-flu#40 (comment)

The current workaround is to use the --force flag to silence the warnings, but this can result in missing actual field name overwrites.

Description

If the old field name is the same as the new field name, then the script should skip renaming which would skip the loud warning:

for old_field, new_field in field_map.items():
if record.get(new_field) and not args.force:
print(
f"WARNING: skipping rename of {old_field} because record",
f"already has a field named {new_field}.",
file=stderr
)
continue

transform-strain-name: build strain name by concatenating fields

Context

Following the naming pattern set in SARS-CoV-2 sequences, strain names are usually <country>/<sample_id>/<year>. All three fields are typically available in the metadata so we can concatenate them to "build" a reasonable strain name.

Example of this feature implemented by @corneliusroemer in nextstrain/mpox@6df9dd1

We could extend the existing transform-strain-name script to accept input columns that are concatenated with a provided separator.

How to vendor scripts in pathogen repos?

This issue was originally written as an overview of git subtree, but later repurposed into a discussion of how to vendor scripts, specifically choosing between git subtree and git subrepo. In the end, we settled on git subrepo noting that this is a small implementation detail that can vary by pathogen repo, and be changed in the future.

Original issue

This is a summary of how I used git subtree as part of #2. Note that different pipelines can choose different methods of vendoring scripts from this repo, but git subtree is particularly nice as it requires no knowledge of its existence from a user of the pipeline.

Helpful reading: Git Subtree basics

The script was added to this repo (nextstrain/ingest) from within nextstrain/hepatitisB using ingest as a subtree. Specifically, from the hepB repo:

# use a branch (in hepB)
git checkout -b 'vendored-scripts'

# add the ingest repo as a subtre, using the 'apply-geolocation-rules' branch
git remote add ingest-remote [email protected]:nextstrain/ingest.git
git subtree add --prefix ingest/vendored ingest-remote apply-geolocation-rules --squash
# Adds a merge commit with one parent the previous host repo HEAD commit,
#     and the other a squashed commit of the 'ingest' repo

# move the script to the subtree repo (ingest/vendored), modify the snakemake
# rules accordingly and commit changes (to the hepB repo)

# push the changes up to the subtree repo (ingest) on branch apply-geolocation-rules
git subtree push --prefix ingest/vendored ingest-remote apply-geolocation-rules
# The commit message was identical, but only the changes to ingest/vendored
# were part of the subtree commit (probably obvious!)

It was tested in monkeypox by pulling in (this branch of) the ingest repo as a subtree, and updating the transform rule accordingly.

Reflections

This approach is pretty straightforward but changing the branch of a subtree seems to pollute the git history a bit. An alternative approach would be to simply have a subtree of ingest at the main branch, push any changes to the subtree up to a branch of the subtree, merge that branch via GitHub (with code review etc), then pull down the changes once they're on the main branch (of the subtree repo).

Given a script with differences in multiple repos, the most straightforward may be to simply to create a to-be-vendored version of a script locally, copy it into each repo to test, and when you are satisfied create a PR in this ingest repo without using subtrees at all. Once it's in main, it is straightforward to pull it into each pathogen repo using git subtree pull ....

Comments / improvements welcome. At the very least this may give others a quick start guide!

Update existing pathogen repos to vendor ingest scripts

Context

Depends on #3

Once we've decided on how to vendor the centralized ingest scripts, we should update the existing pathogen builds to use the ingest scripts.

This will probably be better to do as an iterative process that happens in parallel with #1.

Description

Repos that should be updated to vendor the ingest scripts:

The following repos have WIP PRs for adding ingest to their workflows that can be updated to vendor the scripts:

`transform-genbank-location` should be noiser about violated expectations

Current Behavior

Currently, transform-genbank-location depends on the value of the database field, and silently no-ops if this field is missing or not one of the expected values.

Expected behavior

The script should emit a warning when the input lacks the expected value in the database field or if the field is missing, to make it more clear that it's not actually changing data.

How to reproduce

Steps to reproduce the current behavior:

  1. Run the script on a NDJSON record that doesn't have a database field
  2. Observe that the script runs, exits without error

Possible solution

Emitting a warning to STDERR on the else branch of the core conditional in the main section.

Add script for parsing BioSample records

There has been cases in the past where sample metadata is in a linked BioSample record instead of the GenBank record that we download from NCBI.

We currently only use the BioSample data in the ncov-ingest workflow, but we can generalize the ParseBiosample Transformer to be a portable script that can be vendored for other pathogen ingests.

Add duplicated scripts from pathogen repos

The first step in making this repository useful is to populate it with scripts that are currently manually copied around pathogen repos.

See shared GDoc for additional context and details on scripts.

Progress

This was originally created by @joverlee521 in #1 (comment).

Identical scripts (added in #6)

  • s3-object-exists
  • trigger
  • sha256sum
  • cloudfront-invalidate
  • merge-user-metadata
  • transform-authors
  • transform-field-names
  • tranform-genbank-location

Diverged scripts with various different versions used across workflows
(binned into related groups):

Simple notify scripts (added in #8)

  • notify-slack
  • notify-on-job-start
  • notify-on-job-fail

S3 interaction + notify scripts that depend on S3 files (added in #12)

  • upload-to-s3
  • download-from-s3
  • notify-on-record-change
  • notify-on-diff
  • trigger-on-new-data

Genbank interactions

  • genbank-url (added in #16)
  • fetch-from-genbank (based on genbank-url) (added in #16)
  • fetch-genbank (hepatitsB) (added in #15)

Nextclade joining

  • join-metadata-and-clades (TBD) Dropping custom Python script in favor of csvtk/tsv-utils commands (nextstrain/mpox#207)

Potential augur curate scripts

  • transform-strain-names (added in #20)
  • apply-geolocation-rules: #4

Summary of differences

This is the original issue text from @jameshadfield.

Here's a quick scan of duplicated ingest scripts, using monkeypox as the "base", against 4 other ingest script directories:

Directories of scripts considered:

mpx       # monkeypox/ingest/bin at a1f0d7b
hbv       # hepatitisB/ingest/scripts at 1cdd197
rsv       # rsv/ingest/bin at ba171f4
dengue    # dengue/ingest/bin branch: new_ingest @ 247b2fd 
ncov      # ncov-ingest/bin at 88fddbe

Note that when there's only 1-3 lines different that's often just an added comment to indicate where the script's been copied from

mpx/apply-geolocation-rules
        rsv/apply-geolocation-rules     IDENTICAL
        hbv/apply-geolocation-rules.py      17 lines different
        dengue/apply-geolocation-rules  IDENTICAL
mpx/cloudfront-invalidate
        rsv/cloudfront-invalidate       IDENTICAL
        dengue/cloudfront-invalidate    IDENTICAL
        ncov/cloudfront-invalidate      IDENTICAL
mpx/csv-to-ndjson
        rsv/csv-to-ndjson.py      16 lines different
        dengue/csv-to-ndjson    IDENTICAL
        ncov/csv-to-ndjson       3 lines different
mpx/download-from-s3
        dengue/download-from-s3       2 lines different
        ncov/download-from-s3       8 lines different
mpx/fasta-to-ndjson
        rsv/fasta-to-ndjson     IDENTICAL
        dengue/fasta-to-ndjson  IDENTICAL
mpx/fetch-from-genbank
        dengue/fetch-from-genbank       1 lines different
mpx/genbank-url
        rsv/genbank-url      42 lines different
        dengue/genbank-url      11 lines different
mpx/join-metadata-and-clades.py
        rsv/join-metadata-and-clades.py       3 lines different
        dengue/join-metadata-and-clades.py      IDENTICAL
        ncov/join-metadata-and-clades     114 lines different
mpx/merge-user-metadata
        rsv/merge-user-metadata IDENTICAL
        dengue/merge-user-metadata      IDENTICAL
mpx/ndjson-to-tsv-and-fasta
        rsv/ndjson-to-tsv-and-fasta     IDENTICAL
        dengue/ndjson-to-tsv-and-fasta  IDENTICAL
mpx/notify-on-diff
        dengue/notify-on-diff   IDENTICAL
mpx/notify-on-job-fail
        rsv/notify-on-job-fail       1 lines different
        dengue/notify-on-job-fail       1 lines different
        ncov/notify-on-job-fail      10 lines different
mpx/notify-on-job-start
        rsv/notify-on-job-start       3 lines different
        dengue/notify-on-job-start       3 lines different
        ncov/notify-on-job-start      30 lines different
mpx/notify-on-record-change
        rsv/notify-on-record-change       3 lines different
        dengue/notify-on-record-change       3 lines different
        ncov/notify-on-record-change       6 lines different
mpx/notify-slack
        rsv/notify-slack      15 lines different
        dengue/notify-slack     IDENTICAL
        ncov/notify-slack      16 lines different
mpx/reverse_reversed_sequences.py
        dengue/reverse_reversed_sequences.py    IDENTICAL
mpx/s3-object-exists
        rsv/s3-object-exists    IDENTICAL
        dengue/s3-object-exists IDENTICAL
        ncov/s3-object-exists       1 lines different
mpx/sha256sum
        rsv/sha256sum   IDENTICAL
        dengue/sha256sum        IDENTICAL
        ncov/sha256sum       1 lines different
mpx/transform-authors
        rsv/transform-authors   IDENTICAL
        dengue/transform-authors        IDENTICAL
mpx/transform-date-fields
        rsv/transform-date-fields       IDENTICAL
        dengue/transform-date-fields    IDENTICAL
mpx/transform-field-names
        rsv/transform-field-names       IDENTICAL
        dengue/transform-field-names    IDENTICAL
mpx/transform-genbank-location
        rsv/transform-genbank-location  IDENTICAL
        dengue/transform-genbank-location       IDENTICAL
mpx/transform-strain-names
        rsv/transform-strain-names       1 lines different
        dengue/transform-strain-names   IDENTICAL
mpx/transform-string-fields
        rsv/transform-string-fields     IDENTICAL
        dengue/transform-string-fields  IDENTICAL
mpx/trigger
        dengue/trigger  IDENTICAL
        ncov/trigger    IDENTICAL
mpx/trigger-on-new-data
        dengue/trigger-on-new-data       1 lines different
        ncov/trigger-on-new-data       6 lines different
mpx/upload-to-s3
        rsv/upload-to-s3       3 lines different
        dengue/upload-to-s3       3 lines different
        ncov/upload-to-s3       1 lines different

Remove ndjson scripts from existing pathogen repos

Follow-up to #1. Copying some useful text from the Google doc linked there:

csv-to-ndjson / tsv-to-ndjson

No longer needed as this can be done with augur curate passthru.

We can remove this script from the following repos (found with code search) as we update their workflows to use augur curate. 

  • hepatitsB
  • monkeypox
  • nextstrain/ncov-ingest@6ef4dc0
  • Rsv
  • Dengue (won't show up in code search as it's on a branch)
  • Zika and Ebola (won't show up in code search as it's on a branch)

fasta-to-ndjson

No longer needed as this can be done with a combination of augur parse and augur curate passthru. 

We can remove this script from the following repos (found with code search) as we update their workflows.

  • monkeypox
  • Rsv
  • Dengue (won't show up in code search as it's on a branch)
  • Zika and Ebola (won't show up in code search as it's on a branch)

ndjson-to-tsv-and-fasta

No longer needed as this can be done with augur curate passthru. 

We can remove this script from the following repos (found with code search) as we update their workflows.

  • monkeypox 
  • Rsv
  • Dengue (won't show up in code search as it's on a branch)
  • Zika and Ebola (won't show up in code search as it's on a branch)

fetch-from-ncbi-virus does not include nucleotide sequences

Current Behavior

Previously, the nucleotide sequence per record would be included as sequence, since we are pulling the nucleotide sequence as part of the NCBI Virus URL

('sequence', 'Nucleotide_seq'),

However the monkeypox ingest workflow has been returning an empty values for sequences.
Looking back at previous versions of s3://nextstrain-data/files/workflows/monkeypox/genbank.ndjson.xz:

2023-09-05 (version id c.cdLtg8OxV1Pyl8SSlWE1_dqKpQBT.z) - still included sequences for all records
2023-09-06 (version id PaqGNfdlQXH7eV9b.WVpaOm5ioQ1pVD2) - 240/6751 records did not include sequence
2023-09-07 (version id nImnSdA8NDGCJdVuDuMmsoFB_hveCkCC) - 6071/6762 records did not include sequence
2023-09-08 (version id UZ9VwlVMqVfAeP0sMux9qE4H1e6dGZRP) - none of the 6809 records included sequences
2023-09-09 (version id VWxHnqlAUVEGRU4_ngYsJuctK7Tftyyn) - none of the 6807 records included sequences

I had wondered if there was a bug in the centralized ingest script, but running the recently deleted monkeypox fetch-from-genbank script returns the same results without sequences.

NCBI Virus observations

The nucleotide sequence field name has not changed since you still download the sequences in a FASTA file with the same field name:

https://www.ncbi.nlm.nih.gov/genomes/VirusVariation/vvsearch2/?fq={!tag=SeqType_s}SeqType_s:("Nucleotide")&fq=VirusLineageId_ss:(10244)&cmd=download&sort=SourceDB_s desc,CreateDate_dt desc,id asc&dlfmt=fasta&fl=AccVer_s,Definition_s,Nucleotide_seq

However downloading as CSV or JSON format results in empty column for Nucleotide_seq.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.