smart-on-fhir / bulk-data-client Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 8.0 650 KB

License: Apache License 2.0

JavaScript 56.22% TypeScript 43.78%

bulk-data-client's People

Contributors

Stargazers

Watchers

Forkers

jmaggesi prb112 mgold-greenway bulk-dqm theshedman mikix uwcirg dtphelan1

bulk-data-client's Issues

Toggling off DocumentReference attachment processing

Looking for a configurable option to not process DocumentReference url attachments at all regardless of size or type, and didn't see anything that does that in the available params. The params used to dictate whether to inline the content of the URL attachment come close, but, in my case, even when when setting to inlineAttachmentSize to 0, individual bundles containing the attachment are placed in the 'attachment' folder (one for every line of the DocumentReference NDJSON file that has a URL attachment). The client seems to take a lot of time to iterate/process large numbers of these DocumentReference.content.attachment.url elements so I'm thinking that I'd like to be able to toggle that functionality off...

Thanks!

User-Agent string is awkward

Using the bulk-data-client to connect/test with FHIR $export and backend services I noticed the user-agent string.

NODE_DEBUG=* AUTO_RETRY_TRANSIENT_ERRORS=1 SHOW_ERRORS=1 node . --config config/ibm-fhir-server.js --global --_type Patient
user-agent: Bulk Data Client <https://github.com/smart-on-fhir/bulk-data-client>

It appears to come from request.js

exports.default = source_1.default.extend({
     hooks: {
         beforeRequest: [
             options => {
               options.headers["user-agent"] = "Bulk Data Client <https://github.com/smart-on-fhir/bulk-data-client>";

I've never seen this format for a user-agent, per Mozilla a User-Agent should follow:

User-Agent: <product> / <product-version> <comment>

I think some proxies may reject this an injection attack or strip it away.

I'm bringing this up for your awareness. Thank you for the client, it's great.

Provides token even if requiresAccessToken is false

Causes an error when link target is a AWS signed URL

`ERR_STREAM_PREMATURE_CLOSE` error on early `destroy()` when downloading ndjson

When testing the bulk data client with the SMART Bulk Data Server, I receive the following error from the CLI: Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close.

To reproduce, use the following parameters:

fhirUrl: https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir
group: 048d4683-703a-4311-963d-48e515a6372b
All other parameters used are the defaults in config/defaults.js

See the screenshot below:

For reference, I am using Node.js version 18.12.0. I do not receive this error on Node.js version 16. I believe this issue may be related to the got dependency upgrade to version 11.8.6. I commented on this open Github issue to bring awareness of this issue to the got team.

Expose debugging header fields

At least in Cerner's case, there are some request headers that they like to ask for when you go to them for support. Specifically in Cerner's case:

X-Request-Id
Cerner-Correlation-Id

If either of those headers are present, it might be nice to put them in the output or error log or something. Hard to get after the fact.

I'm sure Epic has its own version of this, which would also be nice to expose, but I don't know those off hand.

Allow an inline-only mode for DocRef attachments

User problem

I want to bulk export & archive DocumentReferences from my EHR while inlining attachments I know I'll care about in the future. This is both so that I have an archival record of the attachments I'm using right now and in case the EHR becomes unavailable in the future, I can still continue to refer back to the attachments.

However, I don't care about most mimetypes. I really only want html & text. PDFs, RTFs, etc will not be interesting to me, and will just take a long time to download.

Proposed fix

Instead of having the "inline attachment" logic be a subset of the "download attachment" logic, make them sibling paths. i.e. allow an inline-only mode where we don't bother download an attachment that is marked as a pdf in the FHIR record.

How would you like to see the user configuration for such a mode be done? (i.e. a new option like inlineOnly or re-use existing configuration and just start paying attention to inlineDocRefAttachmentsSmallerThan even if downloadAttachments is false?

I have some code that I'm testing for my own purposes to do this, and I can propose a PR later. But it will likely need some tweaking to fit your preferred configuration changes to allow for "inline only".

`--_since` Silently Dropped If Time Is Malformed

Summary

The _since export parameter requires times to be formatted as FHIR instant datatypes. If a time is malformed the client will silently drop the parameter. Bulk Data Exports can take a long time to run and its not obvious if an elected parameter is silently dropped when running.

It would be great if malformed _since values were caught early so that users aren't surprised when the export request doesn't include the parameter.

Examples

Valid Time Format Example

The following example illustrates a valid _since parameter which is properly included and everything works as expected.

The invocation: node . --_since 2010-03-14T09:00:00.000-04:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b

> node . --_since 2010-03-14T09:00:00.000-04:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b
Kick-off started
Kick-off completed
Bulk Data export started
Status endpoint: https://bulk-data.smarthealthit.org/fhir/bulkstatus/9a593ac26be5e169a7aaa6cc3279eb90
Bulk Data export completed in 10 seconds

Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100%
          Downloaded Files: 11 of 11
            FHIR Resources: 1,793
               Attachments: 0
           Downloaded Size: 125.6 kB
         Uncompressed Size: 1.9 MB
         Compression ratio: 1/16

Download completed in 1 second
Do you want to signal the server that this export can be removed? [Y/n]Y

The server was asked to remove this export!

Nothing in the CLI report indicates a _since parameter was included, but inspecting the downloads/log.ndjson reveals the full export URL which includes the _since parameter:

"https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir/Group/048d4683-703a-4311-963d-48e515a6372b/$export?_since=2010-03-14T09%3A00%3A00-04%3A00

Invalid Time Format Example

The following example uses the same invocation except the leading zero from the time zone component is missing.

The invocation: node . --_since 2010-03-14T09:00:00.000-4:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b

node . --_since 2010-03-14T09:00:00.000-4:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b
Kick-off started
Kick-off completed
Bulk Data export started
Status endpoint: https://bulk-data.smarthealthit.org/fhir/bulkstatus/2af9534143f2bae9522780466490bc35
Bulk Data export completed in 11 seconds

Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100%
          Downloaded Files: 12 of 12
            FHIR Resources: 2,185
               Attachments: 0
           Downloaded Size: 178.5 kB
         Uncompressed Size: 2.7 MB
         Compression ratio: 1/16

Download completed in 1 second
Do you want to signal the server that this export can be removed? [Y/n]Y

The server was asked to remove this export!

Inspecting the downloads/log.ndjson reveals that the _since parameter was not included: "https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir/Group/048d4683-703a-4311-963d-48e515a6372b/$export

Log export metadata

First cut at data we probably want to capture:

For kick-off request:

Timestamp (there may be value in capturing all timestamps as time with local tz rather than UTC so can easily determine if time of day consistently impacts performance)
Vendor and server info (should be available via the server's capability statement: software.name, software.version, software.releaseDate, fhirVersion)
Server URL and/or another way to distinguish between sites
Request parameters (possibly exclude the Patient reference list, since this could be very large?)

For kick-off error (when applicable):

Timestamp
Error code
Error body

For status error (when applicable):

Timestamp
Error code
Error body

For completion manifest (when applicable):

Timestamp
Resource file count
Error file count

For each file download:

Timestamp (start)
File identifier (hash of output[].url ?)
Resource type (output[].type)

For failed file downloads:

Timestamp (end)
File identifier (hash of output[].url ?)
Error code
Error content

For completed file downloads:

Timestamp (end)
File identifier (hash of output[].url ?)
Line count
Size in bytes

Design thoughts

Option to turn logging on and off by config and command line parameter
Could put a file in output directory of each export so it's easy for users to review and share
Probably want to generate a unique id for each export and include it in all the log items so they're it's to import them into a data store and query over them
ndjson would be flexible and familiar to users
Lines could be objects with standard properties like:
- exportId - UUID generated per export
- timestamp - as mentioned above, maybe iso 8601 with tz
- eventName - e.g., "kick_off", "kick_off_error", "status_error", "manifest", "download_start", "download_end", "download_error", etc.
- eventData - object with event specific parameters

Open questions

Are there other export activities or data that it would make sense to capture?
Do we want to capture a set of client side errors like "error_writing_file"?
Does the event format above make sense?
How much data should we repeat in each log item for easy queryability (e.g., include resource type in all of the file related events or just in the download start and assume that users can join events on the file identifier)?

cc: @vlad-ignatov @mikix

Downloading attachments does not handle Binary references

It looks like the current attachment downloading code assumes the url of a DocumentReference attachment points at raw data. The spec seems to allow for that not to be the case.

Binary URLs

From https://www.hl7.org/fhir/datatypes.html#Attachment:
If the URL is a relative reference, it is interpreted in the same way as a resource reference (https://www.hl7.org/fhir/references.html#references)

And we often see that be a Binary reference. For example, Epic usually gives "url": "Binary/xxxxxxx".

And then you presumably request the URL, get {"contentType": "xxx", "data": "xxx"}, then do the real data extraction.

Absolute URLs

The spec does seem to imply that a full absolute URL will just point at raw data. But it appears that Cerner will give a full URL like https://fhir-ehr.cerner.com/r4/ec2458f2-1e24-41c8-b71b-0e701af7583d/Binary/XR-198381926 as seen in their docs.

So looks like even if you are given a full URL, you have to see if it starts with the base URL first, then do the extra Binary handling.

Non-Binary URLs

I dunno what you do with a relative FHIR URL that isn't a Binary, but something like Patient. Presumably that doesn't happen for DocumentReference.

Caveat

I've not done much real world testing. Just spec reading and hearing from folks that have access to Epic. I have access to a Cerner sandbox, but it errors out when I try to export DocumentReference. So... 🤷

'onCancel' handler was attached after the promise settled

Hello, I am trying to test the bulk data client with the SMART Bulk Data Server and I am receiving the following error when running the CLI: "The onCancel handler was attached after the promise settled". For context, I am using the following parameters:

fhirUrl: "https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir"
group: "048d4683-703a-4311-963d-48e515a6372b" (the Neighborhood Health Plan group from the SMART Bulk Data Server)
All other parameters used are the defaults in config/defaults.js

The screenshot below shows the error (as well as the npm/node versions that I am working with). The error occurs when trying to download the files (note: I have been able to run the CLI successfully with this fhirUrl and group, so this error is not consistently being thrown)

The onCancel handler error seems to be coming from the got dependency (based on my understanding of this GH issue in the got repository).

Are there any configuration steps that I have missed that would result in this error? Thank you for your help!

Bearer token is expected to be capitalized Bearer.

Per https://datatracker.ietf.org/doc/html/rfc6750#section-2.1, the Bearer token is expected to be capitalized Bearer.

     b64token    = 1*( ALPHA / DIGIT /
                       "-" / "." / "_" / "~" / "+" / "/" ) *"="
     credentials = "Bearer" 1*SP b64token

In BulkDataClient.ts line 205 the bearer is lower case.

This had an unexpected behavior with Keycloak where it was not parseable as a Token.

I'm raising this as it may be something to be aware of.

Output doesn't give a lot of details on what resourceType was exported

Output doesn't give a lot of details on what resourceType was exported:

Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0%
          Downloaded Files: 0 of 13
            FHIR Resources: 0
               Attachments: 0
           Downloaded Size: 0.0 B

It'd be great if there was an indication of what ResourceType this references.

Thanks, this is a great tool.

Files Fail to Download

It appears that a recent commit caused a regression resulting in empty bulk data download files. ~~I was able to use git bisect to trace the regression to commit 8b642be~~. Upon further investigation I noticed that the TypeScript is not transpiled at every step, which means the error could be in a different commit. 3f8d2e3 needs a minor modification to transpile (adding file?: string to the LoggingOptions type), but otherwise works.

470f6f0 appears to be the true culprit.

Here is the command I used:

node . -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b

The console will show:

Kick-off started
Kick-off completed
Bulk Data export started
Bulk Data export completed in 10 seconds

Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100%
          Downloaded Files: 12 of 12
            FHIR Resources: 2,185
               Attachments: 0
           Downloaded Size: 178.5 kB
         Uncompressed Size: 2.7 MB
         Compression ratio: 1/16

Download completed in 2 seconds

But, the files in the download directory will be empty:

> ls -al downloads

... 480 Jan 10 17:26 .
... 704 Jan 10 17:26 ..
... 0 Jan 10 17:26 1.CarePlan.ndjson
... 0 Jan 10 17:26 1.Claim.ndjson
...

Tested with node v17.19.1.

Download attempts should be retried

I've had the situation recently where Cerner likes to give many many tiny files as the bulk export result. Each file has no more than 200 lines and no more than one patient. (So some files are very short, like one line, if the patient doesn't have many resources.) In one recent example, the server wanted me to download over 20k files.

In this situation, the bulk-data-client starts tearing through those files as fast as it can (and by default that means 5 files at once). But that means the server quickly gets mad at it, giving 429 errors and the like.

To avoid 429 errors, I've had to reduce the parallelDownloads setting to 1. But even then, I get the occasional http hiccup (20k requests after all) like a 502 gateway error.

So my feature request is to retry download attempts in general, with some backoff for 429 results specifically.

Add support for downloading & inlining DiagnosticReport.presentedForm

It's a very similar use case to DocumentReference.content.

I myself don't need this for a specific need. I'm just envisioning a future time where the Cumulus team might want to run NLP on those notes, and thus archive/inline the text when exporting.

But for now, this is a low priority wishlist request.