smart-on-fhir / fhir-bulk-data-docs Goto Github PK

View Code? Open in Web Editor NEW

75.0 36.0 28.0 461 KB

Documentation and issue tracking for the emerging FHIR bulk data implementation guide

Jupyter Notebook 100.00%

fhir-bulk-data-docs's Introduction

FHIR Bulk Data Access

Note

This repository is no longer being used for official development of the specification. Please see:

https://hl7.org/fhir/uv/bulkdata for the formal publication (first published Auguest 2019)
https://github.com/hl7/bulk-data for the GitHub repository in active use, within the HL7 org
https://build.fhir.org/ig/HL7/bulk-data for the up-to-the-minute continuous integration build of the spec

Resources

fhir-bulk-data-docs's People

Contributors

Stargazers

Watchers

fhir-bulk-data-docs's Issues

Diagram error

This is just a reminder that the auth diagram is still wrong (showing "iss": "https://{app url}").

Use case :: Resource Mappings (e.g., for Common Clinical Data Set and for Financial Data)

We've had some discussion about which resources we'd expect as a "minimum bar" for specific use case.

In the Argonaut context, we'd want to spell out a use case focused on the Common Clinical Data Set, focused on the resources that have been profiled by Argonaut to meet this data set: http://www.fhir.org/guides/argonaut/r2/

In the Financial context, it would be good to spell out what bulk financial data would look like, e.g. in a way that's compatible with the resources specified in the CMS Blue Button 2.0 API: https://bluebutton.cms.gov/developers/#fhir-data-model

It would be to create a use-cases.md file with details for this.

Using the Group resource

How might we effectively use the Group resource to support bulk export data? The example here seems to pertain only to the Patient resource. What if we consider a use case where we are attempting to get a slice of lab results based on some grouping criteria?

The FHIR specification indicates that the Group resource has a member attribute, which I would assume is the location where an array of the actual data of interest would be housed (for example, a list of Observation.)

https://www.hl7.org/fhir/group.html

The issue I see here is that member.entity represents data by reference - this would equate to a relative FHIR URL for each value.

For example, given the following GET call:

http://example.org/Group/12345/Observations

We'd most likely get the following (some attributes omitted for brevity):

{
    "resourceType": "Group",
    "identifier": [
        {
            "system": "http://example.org",
            "value": "12345"
        }
    ],
    "member": [
        {
            "entity": {
                "reference": "http://example.org/Observations/1"
            }
        },
        {
            "entity": {
                "reference": "http://example.org/Observations/2"
            }
        },

    ...
    ]
}

Should we consider using the Bundle resource instead, which contains Bundle.entry.resource, allowing for an actual representation of the data?

Clear the authorization docs

While trying to implement the latest auth spec, I encountered several problems. They might be an issue or maybe the descriptions are just not clear enough. I will submit those as comments below:

[Question] How do you surface the markdown files?

I am wondering how you display the source controlled markdown files (that will eventually make up a full fledged implementation guide) contained in this repo? I think source controlling them here is a great idea - curious to know if there is a web hook or similar process to ingest and host them once changes are merged, etc.

What are the requirements around the issuer URL, what is it used for?

Are there requirements we should define for the issuer URL? Initial thoughts:

https?
Recommended methods to confirm domain is owned by the consuming application?
absolute URL?

This is also a bit artificial right now - the issuer doesn't currently serve a purpose if the client id is going to be assigned. We could use the issuer URL to help bootstrap trust. EG: A model where the consumer / client offers an https:// endpoint as their “issuer URL”, the EHR then discovers their JSON Web Keys from that URL (for example, at ./.well_known/smart_client_keys). The issuer could replace the client id concept (or it could exist in the background but may not be used in practice) and no longer require registering public keys (and headaches associated with rotation and revocation…).

One of the concerns we've discussed with this approach is how this would work with off the shelf OAuth servers.

Tools and References sections?

We should move the tools out of the core part of the specification. We could do this a few ways, by providing references to them in a summary or abstract, or by having a section that references them.

We should have a references section that links to prerequisite specifications (OAuth 2, JWT, etc).
In addition, we should try to defer to these specifications when discussing specifics (eg: how to validate a JWT) since they're the authoritative source.

Define system-wide $export

Our Argonaut use cases have focused on exporting data about groups of patients. There are also bulk export use cases for non-patient-related resources like:

Export all resources from a server (e.g., for backup / snapshots)
Export all resources of a particular type (e.g., "terminology data export)

We can handle these nicely by defining a system-level export like:

GET [base]/$export?_type=CodeSystem,ValueSet

Token Expiration too long

The current specification calls out a 15 minute expiry, though OAuth 2 recommends that tokens live no longer than 10 minutes. In addition, the SMART specification even recommends 5 minutes. I assume we would want 10 or less before expiry.

Define expectations for rate limiting on polling requests

From May Connectathon:

Throttling while polling
- we should have consistent error codes (e.g. 429 status code for "too many requests")
- We need to test the currently defined approach with "Retry-After" header containing a http date or a delay time in seconds
- Should servers prevent client new exports while a previous request is running?

Within JWKS, do we expect certs, bare keys, or both?

When creating a JWKS, each key can be represented in a few ways:

x5c property (a PEM-encoded certificate chain)
x5u property (a URL leading to a PEM-encoded certificate chain)
"bare key" with properties like
- n and e (modulus and exponent, for RSA keys)
- crv, x, and y (curve, x-coordinate, and y-coordinate, for EC keys)

We want to avoid a situation where some servers expect x5u values and other servers expect bare keys, without a client being able to tell the difference.

From the perspective of interop, I'd like to propose that a client should always include bare key values, and a server should always be able to process bare key values.

@daliboz @kpshek @isaacvetter and others, I would love feedback. Especially

Do you agree we should say something about this?
Do you agree with making bare keys the "common denominator" for implementations?

Binary resource?

How can we handle resources referencing binary data like report:

{
        "resourceType": "DiagnosticReport",
        "id": "7765466",
        . . . 
        "subject": {
          "reference": "Patient/4342009",
          "display": "SMART, NANCY"
        },
        "encounter": { "reference": "Encounter/4277906" },
        "effectiveDateTime": "2018-03-01T17:07:08.000Z",
        "issued": "2018-03-01T17:11:10.000Z",
        "performer": {
          "reference": "Practitioner/4474007",
          "display": "Pickering, Kathy"
        },
        "request": [ { "reference": "ProcedureRequest/23441893" } ],
        "presentedForm": [
          {
            "contentType": "text/html",
            "url": "https://fhir-myrecord.sandboxcerner.com/dstu2/0b8a0111-e8e6-4c26-a91c-5069cbc6b1ca/Binary/TR-7765466"
          },
          {
            "contentType": "application/pdf",
            "url": "https://fhir-myrecord.sandboxcerner.com/dstu2/0b8a0111-e8e6-4c26-a91c-5069cbc6b1ca/Binary/XR-7765466"
          }
        ]
      }

(Coming from Cerner)

Binary resource cannot be exported in ndjson files

Can a service have registration-time JWKS URI but not provide a jku header?

In this section https://github.com/smart-on-fhir/fhir-bulk-data-docs/blob/master/authorization.md#server-obligations-for-signature-verification, it is said that:

"If jku is absent, create a set of potential key sources consisting of: all keys found by dereferencing the registration-time JWKS URI + any keys supplied in the registration-time JWKS. Proceed to step 3."

In my implementation, an app is registered either with JWKS or with JWKS URL (but not both). If the app has a JWKS URI, its token header contains jku. Otherwise it does not and uses JWKS instead.

It appears to me that the two bold parts above should be joined with or instead of +, because I can't have both of them? I am not sure about this though. Perhaps I have not understood it correctly?

Rename status response "secure" to be more precise

The status response currently includes a boolean secure element, which is intended to indicate to the client a requirement to authenticate with their backend services access_token when accessing a specific file of FHIR resources.

The backend services access_token is intended to be short-lived and may have expired before files are ready to be downloaded. If an access_token is required, the client would need to re-request it.

While it's probably valuable to inform the client that they do need to use an access_token to access these files, there are a number of alternative security mechanisms in wide-spread use that would make the download of bulk PHI "secure" that would not involve an OAuth2 access_token. (AWS's signed urls are a good example).

For these reasons, we should look at renaming this element from secure to a word that more precisely describes the requirement for the use of an http Authorization header access_token.

Typo in Bulk Data Status Request section

The text says "Clients should follow the an exponential backoff..."
We should remove "the".

Simplification of Verification algorithm and related JOSE header parameters

The Registration section of authorization.md states a service must either register a JWKS or a JWKS URL, i.e. one or the other. So, the optional jku header adds no value since the current Verification algorithm requires that the server first confirm the jku value matches the preregistered URL value. It seems to me that step 1 of Verification algorithm could be eliminated entirely and step 2 shortened, changing the "+" to "or", since only one of those two alternatives should apply.
If step 1 is retained, there is a typo in 1a (should be "is not whitelisted")
(minor) the typ header is not needed, as only JWTs are allowed for the relevant request parameters and no type disambiguation is needed (ref: RFC7519 section 5.1)
(minor) consider rephrasing step 3 in the positive instead of the negative, i.e. select the keys where alg and kid match... (and changing "remaining" in the next step to match)

OpenId token in backend services auth profile

Grahame suggests we add language to indicate that if the client that is connecting to a server using back end services spec is acting on behalf of a human user, it should identify the user to the back end server with an openid token

Move download links from header to body?

Currently, generate file links are passed in an http Link header, per https://tools.ietf.org/search/rfc5988#page-6 . Some servers have an 8k limit on headers which may lead to failure in the case of long link urls and/or many links.

Should we instead pass the links in the response body as ndjson? Perhaps something like:

{
  "_type": "Observation", 
  "href": "https://data/file/location/0001.Observation.ndjson"
}

This would respect the initial Accept: application/fhir+ndjson header by unifying the content types for the file list the files themselves. It would also let us include additional metadata down the road.

Renamed requiresAccessToken to requiresAccessToken

At the very last line of https://github.com/smart-on-fhir/fhir-bulk-data-docs/blob/master/export.md (the change log) it says "Renamed requiresAccessToken to requiresAccessToken". The requiresAccessToken word is the same in both places.

Non-Patient Compartment Resources

Do we need to specify which non-patient resources (like Practitioner and Organization) should be included in the $export requests if the _type parameter isn't supplied or should we leave this to the discretion of the server implementation (which may be more realistic in real world implementations)

Comments on authorization.md

Should the "JWKS URL (preferred)" URL be specified at a .well-known URL?
Perhaps the recommended token expire time should be 10 minutes? 10 sees a little more common and provides a bit more leeway if system clocks are a little off.

Allow clients to rotate keys?

If we want to allow clients to rotate their own keys, we should describe:

How a client can include a the "key ID" in its authentication JWTs
How an EHR can resolve a set of keys for a client based on the issuer URL

Missing referenced resources

More a question than an issue: is the expectation that all referenced resources should be part of an export set? Or is it admissible that some exported resources references resources that are not in the export set?

If it is the latter - what is the client supposed to do with those references? Try to access directly the server to get that data? ...that could generate a ton of traffic....

(I am asking because the SMART test server export Encounter referencing Organization that are not in the export set, and the Cerner server exports Condition referencing Encounter that are not in the export set)

Specify "Accept: application/json" for polling request

Should clarify that a client should supply Accept: application/json for a polling request.

See discussion at https://chat.fhir.org/#narrow/stream/95-bulk-data/subject/Cologne.20Connectathon/near/148990

File name format

Currently we have some example file names like 0001.Patient.ndjson. It is probably worth documenting the file name format better. Possible questions are:

The number prefix is designed to control how the client OS sorts the files. Does it have to be placed before the resource type (eq. 0001.Patient.ndjson), or after it (Patient.0001.ndjson)?s
Also, do these numbers really need to be prefixed with zeroes and if so, to what length?
The resourceType portion - should it be capitalized to match the FHIR conventions or lowercased to improve cross-OS compatibility?

Use of "Expires" header in Complete Status response

Hey folks, I was wondering if it's really appropriate to use the "Expires" HTTP header in the "Complete" status response to indicate when the results will expire.

After digging into this a bit for our implementation of a FHIR Bulk Data service, I realized the "Expires" header is actually part of the HTTP Cache-Control which deals with caching of web responses.

I won't proclaim to be an expert in HTTP but it seems like we're misusing this header for something it wasn't quite intended for and instead, it may be better to actually put the expiration date of the files within the actual JSON body of the response (along with the file URLs, etc.)

Does anyone have thoughts on this one way or the other?

Where is server's nominated transaction time returned?

This is the time that can be used as the start search parameter in future requests. Should we return a bundle as one of the links in every request to host this metadata or add a custom header? If we end up moving the links to the body, we could potentially include it there per #1 . Other approaches?

Allow clients to specify output target as part of export request

Currently, the server controls where to store export files after they have been generated. However, we (myself and the other members of the GCP FHIR team) believe users and clients may want to control the destination target for the generated files. Specifically, controlling the destination allows users/clients to implement export retention policies, localization policies, additional access control policies, and so on.

So, for example, a client could say something like:

GET [fhir base]/Group/[id]/$export?destination=sftp://1.2.3.4/folder

Or...

GET [fhir base]/Group/[id]/$export?destination=d://path/to/network/folder

Or, in our case,

GET [fhir base]/Group/[id]/$export?destination=gs://bucket/folder

The FHIR Server would be responsible to describing and implementing supported destinations including whether or not there is a default destination (e.g. local) and the retention policy of that default.

Wht do others think? I tried to preserve backcompat with the default but perhaps that's less of an issue given the version status of the spec.

Handling responses with no data

If the bulk data query doesn't match any data (for examples if no records have been created or updated since the start date), how should this be handled? One option would be for the progress endpoint to return "204 No Content".

How to handle the potential duplicates generated in the execution?

I am thinking from an implementer point of view: the server may want to parallelize the execution, for example, dispatching one job per patient compartment. These patient compartments, however, may include shared resources. In the end, it may be too expensive to aggregate and dedup. I am curious to learn you thoughts. Is it expected that the server has to do a "MapReduce"?

Thank you very much!

URL Operation Conflict

Patient/$everything is already an established operation.

How about an alternative of /$bulkdata (a system-level operation) or if you want the operation "compartmentalized" within established FHIR compartments: Group/[id]/$bulkdata

References between resources

I don't see anything in the specs about references between resources.

The test data from the SMART reference server has resources ids set to GUIDs and the corresponding references using urn:uuid:<guid> - is this normative?

requiresAccessToken

Shouldn't be a better name than requiresAuthorizationToken

How to retrieve data not tied to an individual patient

How should we handle data global entities like organizations?

Support bulk data protocols on other FHIR requests

The Async mechanism has advantages also for other requests and operations. Therefore I suggest that we broaden the scope of bulk data to include also other requests on the FHIR server. This will allow it to be also used in other time consuming or data heavy requests. Examples of those are retrieval of all resources of a type, evaluating measures and applying plan definitions.

Based on the current specification, you could split the spec up into the (semi-independent) patterns presented below.

$export operation
The current bulk data spec uses the parameters _type and _sync, … These should become a regular operation on Patient.

Trigger async response
This pattern triggers a async response of a GET request on the FHIR repo.
The trigger for this would be the header: Prefer = respond-async.
The result will follow the process as explained in the spec with the alternations discussed in the sections below.

Retrieval format
The current spec only recognizes application/fhir+ndjson, I suggest we also support the normal json and xml export.

The last three concepts can be applied to all GET requests in FHIR and are, so far as I can see, compatible with the current FHIR spec.

authorization: clarify server obligation for jti uniqueness validation is limited to 5 min+sub

In consultation with @lmtthws, we'd like to suggest that the jti nonce uniqueness server obligation be further specified.

The Server Obligations section of the authorization spec includes the requirement that the FHIR server:

check that this is not a jti value seen before (prevention of replay attacks)

To prevent the possibility of replay attacks, the server must validate that it hasn't seen this jti value within the maximum lifetime of the requesting JWT that contains the jti nonce. The actual obligation within the specification is less detailed and therefore suggests that the jti should be unique for longer than the 5 minute lifetime of the JWT (per the exp value) -- which is all that's necessary to actually prevent a replay attack. Further, the actual uniqueness validation logic could be enhanced by tracking these jti nonces per issuer and/or sub (client_id) as well. I think that client_id will end up being more controlled than issue, so I'd recommend checking per client_id / sub.

How about changing this phrase from the above, to:

check that this is not a jti value previously encountered for the given sub for the maximum possible expiration length (5 minutes). This check prevents replay attacks.

Isaac

Notes about the JWKS structure

Perhaps we can add a sentence or two to state that:

Services should not include private keys in hosted JWKS.
Public and private keys (pairs) in JWKS should have the same kid.

It is already mentioned between the lines but it takes some time to "discover" it:

kid	required	The identifier of the key-pair used to sign this JWT. This identifier MUST be unique within the backend services's JWK Set.

jku	optional	The URL to the JWK Set containing the public key(s). When present, this should match a value that the backend service supplied to the EHR at client registration time.

https://tools.ietf.org/html/rfc7517#section-4.5 is also not very clear about 2.

I understand that this is not exactly in the scope of this spec, but it would make a big difference for the reader.

Recommendations around error handling

The current backend services specification doesn't call out how servers should handle errors. We should add some requirements around what types of errors are expected and any recommendations. For example, we've used the error_uri (from OAuth) in our responses which help provide human readable error responses (and allows for localization).

How should a server represent errors to the client?

For the bulk data API, let's consider a request for resources for 10k patient records.

Naturally, some errors may occur during data retrieval. We want to communicate these errors to the bulk client.

One way to report errors, where a small subset of resources are unavailable, could be to return an OperationOutcome .ndjson file with OperationOutcome.issue.location used to identify the patient whom this applied to.

Isaac

Should we specify algorithms for asymmetric signatures in authentication JWTs?

Right now we're silent on this. Should we pick one or more algorithms as "must support" if we want consistency? e.g. RS256?

What are the EHR's obligations in validating an authorization JWT?

Currently the spec leaves a TODO to define a server's obligations in validating a client's JWT. We should flesh this list out.

Servers SHALL

validate the signature on the JWT
check that the JWT exp is valid
check that the JWT aud matches the server's OAuth token URL (the URL to which the token was POSTed)
check that this is not a jti value seen before (prevention of replay attacks)
ensure that the client_id provided is known and associated with the supplied iss