smart-on-fhir / bulk-data-client Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Looking for a configurable option to not process DocumentReference url attachments at all regardless of size or type, and didn't see anything that does that in the available params. The params used to dictate whether to inline the content of the URL attachment come close, but, in my case, even when when setting to inlineAttachmentSize to 0, individual bundles containing the attachment are placed in the 'attachment' folder (one for every line of the DocumentReference NDJSON file that has a URL attachment). The client seems to take a lot of time to iterate/process large numbers of these DocumentReference.content.attachment.url elements so I'm thinking that I'd like to be able to toggle that functionality off...
Thanks!
Using the bulk-data-client to connect/test with FHIR $export and backend services I noticed the user-agent string.
NODE_DEBUG=* AUTO_RETRY_TRANSIENT_ERRORS=1 SHOW_ERRORS=1 node . --config config/ibm-fhir-server.js --global --_type Patient
user-agent: Bulk Data Client <https://github.com/smart-on-fhir/bulk-data-client>
It appears to come from request.js
exports.default = source_1.default.extend({
hooks: {
beforeRequest: [
options => {
options.headers["user-agent"] = "Bulk Data Client <https://github.com/smart-on-fhir/bulk-data-client>";
I've never seen this format for a user-agent, per Mozilla a User-Agent should follow:
User-Agent: <product> / <product-version> <comment>
I think some proxies may reject this an injection attack or strip it away.
I'm bringing this up for your awareness. Thank you for the client, it's great.
Causes an error when link target is a AWS signed URL
When testing the bulk data client with the SMART Bulk Data Server, I receive the following error from the CLI: Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
.
To reproduce, use the following parameters:
fhirUrl
: https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhirgroup
: 048d4683-703a-4311-963d-48e515a6372bconfig/defaults.js
For reference, I am using Node.js version 18.12.0. I do not receive this error on Node.js version 16. I believe this issue may be related to the got
dependency upgrade to version 11.8.6. I commented on this open Github issue to bring awareness of this issue to the got
team.
At least in Cerner's case, there are some request headers that they like to ask for when you go to them for support. Specifically in Cerner's case:
X-Request-Id
Cerner-Correlation-Id
If either of those headers are present, it might be nice to put them in the output or error log or something. Hard to get after the fact.
I'm sure Epic has its own version of this, which would also be nice to expose, but I don't know those off hand.
I want to bulk export & archive DocumentReferences from my EHR while inlining attachments I know I'll care about in the future. This is both so that I have an archival record of the attachments I'm using right now and in case the EHR becomes unavailable in the future, I can still continue to refer back to the attachments.
However, I don't care about most mimetypes. I really only want html & text. PDFs, RTFs, etc will not be interesting to me, and will just take a long time to download.
Instead of having the "inline attachment" logic be a subset of the "download attachment" logic, make them sibling paths. i.e. allow an inline-only mode where we don't bother download an attachment that is marked as a pdf in the FHIR record.
How would you like to see the user configuration for such a mode be done? (i.e. a new option like inlineOnly
or re-use existing configuration and just start paying attention to inlineDocRefAttachmentsSmallerThan
even if downloadAttachments
is false
?
I have some code that I'm testing for my own purposes to do this, and I can propose a PR later. But it will likely need some tweaking to fit your preferred configuration changes to allow for "inline only".
The _since
export parameter requires times to be formatted as FHIR instant datatypes. If a time is malformed the client will silently drop the parameter. Bulk Data Exports can take a long time to run and its not obvious if an elected parameter is silently dropped when running.
It would be great if malformed _since
values were caught early so that users aren't surprised when the export request doesn't include the parameter.
The following example illustrates a valid _since
parameter which is properly included and everything works as expected.
The invocation: node . --_since 2010-03-14T09:00:00.000-04:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b
> node . --_since 2010-03-14T09:00:00.000-04:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b
Kick-off started
Kick-off completed
Bulk Data export started
Status endpoint: https://bulk-data.smarthealthit.org/fhir/bulkstatus/9a593ac26be5e169a7aaa6cc3279eb90
Bulk Data export completed in 10 seconds
Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100%
Downloaded Files: 11 of 11
FHIR Resources: 1,793
Attachments: 0
Downloaded Size: 125.6 kB
Uncompressed Size: 1.9 MB
Compression ratio: 1/16
Download completed in 1 second
Do you want to signal the server that this export can be removed? [Y/n]Y
The server was asked to remove this export!
Nothing in the CLI report indicates a _since
parameter was included, but inspecting the downloads/log.ndjson
reveals the full export URL which includes the _since
parameter:
"https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir/Group/048d4683-703a-4311-963d-48e515a6372b/$export?_since=2010-03-14T09%3A00%3A00-04%3A00
The following example uses the same invocation except the leading zero from the time zone component is missing.
The invocation: node . --_since 2010-03-14T09:00:00.000-4:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b
node . --_since 2010-03-14T09:00:00.000-4:00 -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b
Kick-off started
Kick-off completed
Bulk Data export started
Status endpoint: https://bulk-data.smarthealthit.org/fhir/bulkstatus/2af9534143f2bae9522780466490bc35
Bulk Data export completed in 11 seconds
Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100%
Downloaded Files: 12 of 12
FHIR Resources: 2,185
Attachments: 0
Downloaded Size: 178.5 kB
Uncompressed Size: 2.7 MB
Compression ratio: 1/16
Download completed in 1 second
Do you want to signal the server that this export can be removed? [Y/n]Y
The server was asked to remove this export!
Inspecting the downloads/log.ndjson
reveals that the _since
parameter was not included: "https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir/Group/048d4683-703a-4311-963d-48e515a6372b/$export
For kick-off request:
For kick-off error (when applicable):
For status error (when applicable):
For completion manifest (when applicable):
For each file download:
For failed file downloads:
For completed file downloads:
cc: @vlad-ignatov @mikix
It looks like the current attachment downloading code assumes the url
of a DocumentReference
attachment points at raw data. The spec seems to allow for that not to be the case.
From https://www.hl7.org/fhir/datatypes.html#Attachment:
If the URL is a relative reference, it is interpreted in the same way as a resource reference
(https://www.hl7.org/fhir/references.html#references)
And we often see that be a Binary
reference. For example, Epic usually gives "url": "Binary/xxxxxxx"
.
And then you presumably request the URL, get {"contentType": "xxx", "data": "xxx"}
, then do the real data extraction.
The spec does seem to imply that a full absolute URL will just point at raw data. But it appears that Cerner will give a full URL like https://fhir-ehr.cerner.com/r4/ec2458f2-1e24-41c8-b71b-0e701af7583d/Binary/XR-198381926
as seen in their docs.
So looks like even if you are given a full URL, you have to see if it starts with the base URL first, then do the extra Binary handling.
I dunno what you do with a relative FHIR URL that isn't a Binary, but something like Patient. Presumably that doesn't happen for DocumentReference.
I've not done much real world testing. Just spec reading and hearing from folks that have access to Epic. I have access to a Cerner sandbox, but it errors out when I try to export DocumentReference. So... 🤷
Hello, I am trying to test the bulk data client with the SMART Bulk Data Server and I am receiving the following error when running the CLI: "The onCancel
handler was attached after the promise settled". For context, I am using the following parameters:
fhirUrl
: "https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir"group
: "048d4683-703a-4311-963d-48e515a6372b" (the Neighborhood Health Plan group from the SMART Bulk Data Server)config/defaults.js
The screenshot below shows the error (as well as the npm/node versions that I am working with). The error occurs when trying to download the files (note: I have been able to run the CLI successfully with this fhirUrl
and group
, so this error is not consistently being thrown)
The onCancel
handler error seems to be coming from the got
dependency (based on my understanding of this GH issue in the got repository).
Are there any configuration steps that I have missed that would result in this error? Thank you for your help!
Per https://datatracker.ietf.org/doc/html/rfc6750#section-2.1, the Bearer token is expected to be capitalized Bearer.
b64token = 1*( ALPHA / DIGIT /
"-" / "." / "_" / "~" / "+" / "/" ) *"="
credentials = "Bearer" 1*SP b64token
In BulkDataClient.ts line 205 the bearer is lower case.
This had an unexpected behavior with Keycloak where it was not parseable as a Token.
I'm raising this as it may be something to be aware of.
Output doesn't give a lot of details on what resourceType was exported:
Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0%
Downloaded Files: 0 of 13
FHIR Resources: 0
Attachments: 0
Downloaded Size: 0.0 B
It'd be great if there was an indication of what ResourceType this references.
Thanks, this is a great tool.
It appears that a recent commit caused a regression resulting in empty bulk data download files. I was able to use git bisect to trace the regression to commit 8b642be. Upon further investigation I noticed that the TypeScript is not transpiled at every step, which means the error could be in a different commit. 3f8d2e3 needs a minor modification to transpile (adding file?: string
to the LoggingOptions
type), but otherwise works.
470f6f0 appears to be the true culprit.
Here is the command I used:
node . -f https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1IjozLCJkZWwiOjB9/fhir -g 048d4683-703a-4311-963d-48e515a6372b
The console will show:
Kick-off started
Kick-off completed
Bulk Data export started
Bulk Data export completed in 10 seconds
Downloading exported files: ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100%
Downloaded Files: 12 of 12
FHIR Resources: 2,185
Attachments: 0
Downloaded Size: 178.5 kB
Uncompressed Size: 2.7 MB
Compression ratio: 1/16
Download completed in 2 seconds
But, the files in the download
directory will be empty:
> ls -al downloads
... 480 Jan 10 17:26 .
... 704 Jan 10 17:26 ..
... 0 Jan 10 17:26 1.CarePlan.ndjson
... 0 Jan 10 17:26 1.Claim.ndjson
...
Tested with node v17.19.1.
I've had the situation recently where Cerner likes to give many many tiny files as the bulk export result. Each file has no more than 200 lines and no more than one patient. (So some files are very short, like one line, if the patient doesn't have many resources.) In one recent example, the server wanted me to download over 20k files.
In this situation, the bulk-data-client starts tearing through those files as fast as it can (and by default that means 5 files at once). But that means the server quickly gets mad at it, giving 429 errors and the like.
To avoid 429 errors, I've had to reduce the parallelDownloads
setting to 1. But even then, I get the occasional http hiccup (20k requests after all) like a 502 gateway error.
So my feature request is to retry download attempts in general, with some backoff for 429 results specifically.
It's a very similar use case to DocumentReference.content
.
I myself don't need this for a specific need. I'm just envisioning a future time where the Cumulus team might want to run NLP on those notes, and thus archive/inline the text when exporting.
But for now, this is a low priority wishlist request.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.