inadarei / rfc-healthcheck Goto Github PK

View Code? Open in Web Editor NEW

117.0 27.0 13.0 7.57 MB

Health Check Response RFC Draft for HTTP APIs

Home Page: https://inadarei.github.io/rfc-healthcheck/

License: MIT License

Makefile 0.26% HTML 99.61% Shell 0.13%

rfc healthcheck

rfc-healthcheck's Introduction

api-healthcheck

Health Check Response RFC Draft for HTTP APIs

Published RFC Draft: https://tools.ietf.org/html/draft-inadarei-api-health-check

Workspace Setup

> git clone https://github.com/inadarei/rfc-healthcheck.git
> sudo -H gem install kramdown-rfc2629
> sudo -H easy_install pip # optional, if you don't already have it
> sudo -H sudo pip install xml2rfc
> .githooks/install.sh # to enable automated rebuilds on git push

Using

Edit draft.md
To regenerate the latest version of XML/TXT/HTML;
```
make latest
```

Known Implementations

Node.js: https://github.com/inadarei/maikai
Golang: https://github.com/nelkinda/health-go
.NET: https://github.com/RockLib/RockLib.HealthChecks
Python: https://github.com/Colin-b/healthpy

References

In creation of this RFC following existing standards were reviewed and taken into account:

rfc-healthcheck's People

Contributors

Stargazers

Watchers

Forkers

dret randallsquared miqui taliaga peteraritchie nelkinda hgsgtk kalexmills oscaredel brenoinojosa bfriesen michel-zimmer

rfc-healthcheck's Issues

Status update

Hi, @inadarei.
I come upon your RFC following an analysis on HealthChecks guidance and best practices and would like to ask you what is the RFC status as it seems expired.
Cheers
Carlos Souto

There are a few tiny mistakes, like "Calrify" instead of "Clarify", and in one place "observedUnit" where clearly the text means "observedValue", also a 1-2 things about grammar (each is singular, and in one case is followed by plural words).

status code definition

I don't fully agree with the status codes definition:

For “pass” and “warn” statuses HTTP response code in the 2xx - 3xx range MUST be used. for “fail” status HTTP response code in the 4xx - 5xx range MUST be used. In case of the “warn” status, additional information SHOULD be provided, utilizing optional fields of the response.

It seems strange to use a 4xx when the request is correct (e.g. well-formed and properly authenticated) and the health resource does exist.
A 5xx should be reserved for when the health resource itself is not operating correctly.
My initial reaction would be to always use a 200 when the status response correctly represents the state of the system, even if that state is fail. I know that it is common practice to use a 5xx status to represent a failure system status, however that information should be on the resource representation, via this media type, and not on the response status.

Version number considerations

General practice disagrees with a statement made about version numbers in the (otherwise great, I already recommend this to projects!) draft. The draft states:

in well-designed APIs, backwards-compatible changes in the service should not update a version number.

However, a lot of people follow semantic versioning, which has the benefit that it is possible to tell, by comparing two version numbers, whether the difference is bugfixes (1.2.7 to 1.2.8), new features (1.2.7 to 1.3.0), or compatibility-breaking changes (1.2.7 to 2.0.0).

I would like to interpret version to be a version that ideally would follow semantic versioning, and releaseID can be something quite customized, like a git commit hash or build number.

Therefore I suggest to change the statement "in well-designed APIs, backwards-compatible changes in the service should not update a version number" to "in well-design APIs, it can be told from update of a version number whether the changes are backwards-compatible".

In case readers haven't heard of semantic versioning, here's a link: https://semver.org/

Fix "notes" and "output" fields in example

The current version of the RFC has an example that contains the following:

  "notes": [""],
  "output": "",

In general, empty strings and arrays are omitted (treated the same as nil/null), so it would probably be better to add example values to these items.

Should this only detail the media type and content?

The RFC details "making an(sic) health check endpoint available"; but should this only be a description of the content, and not detail a unique endpoint? e.g. to support:

GET /dbnode/dfd6cf2b HTTP/1.1
Host: api.example.com
Accept: application/health+json
User-Agent: MyHealthMonitor

Align API Health Response and Checks objects

As we've adopted the micro-service mind-set, we find ourselves with a larger and larger hierarchy of service calls being made from "aggregator" top-level services. For instance, since we use OAuth2 for authentication and Fortress (ANSI RBAC) for authorization, every one of these services will depend on those two. It's pretty easy to quickly have 4-5 micro-services in the hierarchy and in extreme cases these calls "fan-out" to 15 or 20 other systems.

I'd like to propose that the Checks object be aligned with the (top-level) API Health Check object and that an additional value of healthcheck be added to the pre-defined ComponentTypes listed in the specification. I'm a bit conflicted about whether another pre-defined MeasurementName (such as health) would be beneficial or if the MeasurementName should just be left off.

I do realize that it's still up to the service to determine how the healthcheck status of a subsystem affects its top-level status - some out-right failures of a subsystem will cause a degradation while others are more catastrophic - but I think there's a huge advantage in knowing that the format of the incoming data is another healthcheck.

Add node to "known" check fields

In #58 it's noted that node is used in the example without being defined in the specification. While I agree that the functionality requested in #58 may be important, I'd suggest that node or some other unique identifier is needed.

The specification states:

Since each sub-component may be backed by several nodes with varying health statuses, these keys point to arrays of objects. In case of a single-node sub-component (or if presence of nodes is not relevant), a single-element array SHOULD be used as the value, for consistency.

Since an array SHOULD be used, I feel like this specification should also identify the field within the check that identifies which instance of the component the enclosed component represents. The word node or nodeId make sense for clustered operation but perhaps instance or instanceId would be more generic.

Add health-go to the list of Known Implementations

I've developed an implementation of this upcoming RFC for Golang under the name "health-go" and suggest that it will be included in the list of Known Implementations. It is available at https://github.com/nelkinda/health-go

make format open/extensible by using a registry?

it seems like there is a good set of initial values, but if that set may evolve over time to cover additional concepts. one popular model to cover this is to have a registry of values, with those defined in the original specification as the initial registry contents (https://tools.ietf.org/html/draft-wilde-registries-01). it all depends on how much this format is expected to evolve, and how easy that evolution should be. i am just pointing this out as an alternative design option for the format. i'd be more than happy to help with the specifics of establishing a registry.

Minor feedback

Hi, nice initiative!

Fwiw, here some minor observations from my experience trying your proposal:

There's a typo in section 4 (value eside).
When talking about status:
- 'warn' is mentioned with a MUST and later with a SHOULD.
- It may be time-saving for the reader to give some specific examples of which HTTP codes could be used in the returns. It could say 207 for 'warn' and 424 for 'fail' maybe? It's an example of course, but saves time having to go an read all http codes.
In details, take "cassandra:connections" for example. It may be an overkill to use arrays. I believe if this proposal promotes "hierarchies" of health reports, then "cassandra:connections" should be responsible for aggregating the health of its upstream dependencies. Otherwise it may bring duplication and/or risk inconsistencies, wouldn't it?

Cheers!

Fix tiny typo? terminius

There may be a tiny mis typo 'terminius' instead of 'terminus' in draft-inadarei-api-health-check-02.html#status.

ref: https://github.com/godaddy/terminus
ref: https://tools.ietf.org/id/draft-inadarei-api-health-check-02.html#status

Must links also return application/health responses?

Some discussion about recursively performing healthcheck of downstream components has come up. We would like to clarify whether links mentioned in responses of type application/health are expected to return application/health content or not.

If they do, it could enable automated crawling of a system's health.

cassandra:connections[0] uses 'type' instead of 'componentType'

It seems to be allowed to include unspecified keys (according to the use of data in other example checks) but according to the spec, componentType SHOULD be included whenever componentId is included, so this appears to be a typo.

Relevant snippet from draft.md included below.

    "cassandra:connections": [
      {
        "componentId": "dfd6cf2b-1b6e-4412-a0b8-f6f7797a60d2",
        "type": "datastore",
        "observedValue": 75,
        "status": "warn",
        "time": "2018-01-17T03:36:48Z",
        "output": "",
        "links": {
          "self": "http://api.example.com/dbnode/dfd6cf2b/health"
        }
      }
    ],

Should be health+json?

rfc-healthcheck/draft-inadarei-api-health-check-01.xml

Line 386 in 82cf972

Since Hyper+JSON can carry wide variety of data, some data may require privacy

Ability to provide HTTP verb for affectedEndpoints

A single /test endpoint may be available in more than one HTTP verb (amongst GET, POST, PUT, DELETE, PATCH most of the time). And the check may be related to only a subset of those verbs.

As of now there is no way to provide the HTTP verb(s)

observedUnit and connections

In the cassandra:connections there is an observedValue but no observedUnit. If that's implied, what does a connection metric value of 75 signify?

Is links a JSON object or an array?

In version 03, section 3.7 states links (optional) is an array of objects, but in the example, it is described as a JSON dictionary as follows:

"links": {
       "about": "http://api.example.com/about/authz",
       "http://api.x.io/rel/thresholds":
         "http://api.x.io/about/authz/thresholds"
}

Guarantee structure of health/check object relationship

The specification states:

Since each sub-component may be backed by several nodes with varying health statuses, these keys point to arrays of objects. In case of a single-node sub-component (or if presence of nodes is not relevant), a single-element array SHOULD be used as the value, for consistency.

Many languages use binding to convert from JSON to their internal object structure. Typed languages have an especially hard time with the idea that a sub-component might sometimes contain a check[] and that other sub-components might contain a check instead.

I'd propose that the above paragraph should be amended to read (emphasis added):

Since each sub-component may be backed by several nodes with varying health statuses, these keys point to arrays of objects. In case of a single-node sub-component (or if presence of nodes is not relevant), a single-element array ~~SHOULD~~ MUST be used as the value, for consistency.

In Details, report threshold

The Details structure could report an optional theshold to understand at what level the status changed from pass to warn.

This would change 4.4. second sentence to "Clarifies the unit of measureent in which observedValue and threshold are reported, [...]".

This would add a section to chapter 4. Details:
"4.X thresholdValue
thresholdValue: (optional) could be any valid JSON value, such as: string, number, object, array or literal. This value is used to tell the value above or below which the observedValue would change the status from pass to warn."

I had the idea when looking at the cpu utilization in the example and thought of implementing it.

Canary proposal

I am currently deploying canary health check endpoints that indicates baseline info of the systems, and would like to propose it as a simplified interface. Based on the open discussion regarding the details, I think there is value in having a simple output. Example:

Suggested interface:
GET /canary?key=API_KEY&details=true

Not sending details in the query parameter will omit the details from the output.

HTTP Response: 200 OK when topmost status is true, 503 Service Unavailable otherwise.

{
  "name": "api-1",
  "status": true,
  "metrics": {
    "uptime": "P4DT12H30M5.123456S"
  },
  "details": {
    "database": {
      "status": true,
      "metrics": {
        "connections": 25,
        "latency": "PT0.000456S"
      }
    },
    "api-2": {
      "status": false,
      "details": {
        "s3": {
          "status": true,
          "metrics": {
            "latency": "PT0.000123S"
          }
        }
      }
    },
    "redis": {
      "status": true
    },
    "email": {
      "status": false,
      "critical": false
    },
    "node-1": {
      "status": true,
      "metrics": {
        "DiskAvailable": {
          "value": 14,
          "unit": "Gb"
        },
        "DiskUse": {
          "value": 31,
          "unit": "%"
        },
        "MemoryAvailable": {
          "value": 536,
          "unit": "Mb"
        },
        "time": "2018-07-31T01:07:51+00:00"
      }
    }
  }
}

If any critical services respond with status false, the parent service must also be set to status false. Services are critical by default, unless "critical": false is defined.

Let me know your thoughts.

Align example with defined specification

Specification defines usage of JSON attributes for example:

3.3. releaseId
3.8. serviceId

and defines usage of them in examples:

{
"status": "pass",
"version": "1",
"releaseID": "1.2.2",
"notes": [""],
"output": "",
"serviceID": "f03e522f-1f44-4062-9b55-9587f91c9c41",

As per my understanding JSON is case-sensitive and the example should reflect the specification.

Align with Microprofile Health Check spec

Find ways in which this RFC can better align with the microprofile spec for health checks: https://microprofile.io/project/eclipse/microprofile-health

Git commit hash and service build time information

We from Mainflux have been following the latest "Health Check Response Format for HTTP APIs" draft in order to standardize our healthchecks: https://github.com/mainflux/mainflux/pull/1541.

However, we have a few inquires regarding the available fields to hold useful information about service source code and build.

Through our practice, we have found Git commit hash and additional time of the build to be very useful when included in service info - they compliment version information. We do not, however, find adequate fields in the /health response JSON structure where we can put this information. The closest existing field we have found to Git commit hash is releaseId, but the example shows other usage of this field. For build time we did not find anything useful. So currently we added 2 fields in our response, commit and build_time, but they make our response JSON not adhering to the current standard.

What is the best was to proceed? Should this information be added to some of the existing fields somehow, or should the standard be extended?

Clarify supported HTTP methods

What HTTP methods should be supported? I think this should be clarified.

My opinion is that only GET and HEAD should be supported and other methods should trigger a 405 Method Not Allowed response with a Allow header: Allow: GET, HEAD

This issue is somewhat related to #8.

resource for clients that ignore the response body

One thing that seems to be missing is a reference to monitoring systems that only rely on the response status code to make decisions. One of those clients are the common load balancers, that completely ignore the response body and only look at the response status code in order to decide whether or not that node should remain in the pool.

To avoid making the current resource diverge from the correct usage of status codes (see also #4), one option is to have a specific resource to handle this behaviour and that returns a 200 OK in case the service is healthy, or a 5xx when it is failing.

In my view this is very important as one of the main reasons people create healthchecks in the first place is to have integration with these systems. Even though the systems are clearly limited in their ability to read HTTP responses correctly, they should be supported.

`health` link relation

It would be interesting to also propose a health link relation (or similar). I imagine a scenario where service responses include a Link header with it

Link: <https://example.com/monitoring/health>; rel="health"

This link relation could also be used on application/json-home representations

{
  "api": {
      
  },

  "resources": {
    "health": {
      "href": "/monitoring/health"
    }      
  }
  
}

This would be a nice way to improve discoverability.

release_id

Is there a reason release_id is snake_case, and everything else camelCase? If not, let's change it.

Support ASP.NET Core healthchecks

Hi @inadarei,

Folks from Microsoft are developing their own solution for health-checks (see https://docs.microsoft.com/en-us/dotnet/standard/microservices-architecture/implement-resilient-applications/monitor-app-health).

But there is a problem that their format of the status filed isn't compatible (as usual :-) with this RFC draft. They use "healthy/unhealthy" values for statuses. It would be nice to support these values as optional too.

Thank you.

is details a really good name?

I really cringe when I read details in the response, I feel it should be named services or components. In the RFC itself there is description of details:

details: (optional) an object representing status of sub-components of the service in question

Emphasis mine

Overall I would think that details encapsulates human readable reason with computer readable reason (code) why such and such is not in pristine condition (as well as links to documentation for specific code I tried to compose minimal example thus its not included):

{
  "status": "fail",
  "services": {
    "cassandra": [
      {
        "type": "datastore",
        "status": "fail",
        "details": {
          "reason": "Connection error.",
          "code": 10061
         }
      }
    ],
  },
  "details": {
    "reason": "A critical service is not working.",
    "code": 123
  }
}

And of course each sub-component / service (not a detail) shall have its own details.

Also RFC is really pushing metrics everywhere, I think that health-check is simple boolean kind of endpoint that answers the only question:

Can I use the API right now?

If not where can I find more details.

I quickly googled around and found an example

source

output in checks should also be omitted for healthy status

In the output section of The Checks Object, it should explicitly state that the field should be omitted for healthy status. I'd propose the following change:

output (optional) has the exact same meaning as the top-level “output” element, but for the sub-component/downstream dependency represented by the details object. This field SHOULD be omitted for “pass” states.

Introduce componentType that refers a health check service for recursive health checking

Usually, our service architecture is recursivly depending on some levels of services and components. For every service, you could try to write down the dependent components services, but some of them depend on the same ressources, and some components are not directly related, but behind other servers. Therefore, I'd like to be able to specify components of type "service" (or similar) recursively refering a health check structure with its own components and so on.

Do statuses have a 1:1 relationship to HTTP response codes?

I'm currently writing a health check for an application, but I'm a little unclear on the precise relationship between statuses and response codes. The is described as "tightly coupled" but with no further explanation or examples.

Do statuses have a 1:1 relationship to HTTP response codes? I ask as I have cases where it may be useful to have finer grain responses.

pass = 200
warn = 302, a sub-service return a warning state
warn = 307, a sub-service return an error state
fail = 404, running but unavailable
fail = 503, dead

In different scenarios, we may want a load balancer to bracket different response code ranges as healthy eg [200:302] or [200:307]. Assume the load balancer can only monitor the response code. I know we could always write a customizable filter in front of the health check that can decide talked to the load balancer based on the json.

Thanks for the draft is has been most helpful and timely.

Fix text describing the checks object key

In the The Checks Object section, the second paragraph reads:

The key identifying an element in the object SHOULD be a unique string within the details section. It MAY have two parts: “{componentName}:{measurementName}”, in which case the meaning of the parts SHOULD be as follows:

The first sentence of this paragraph should be changed to read (emphasis only to highlight the changes):

The key identifying an element in the object ~~SHOULD~~MUST be a unique string within the ~~details~~checks section.

The JSON structure that this specification describes will fail in many linters and languages if the keys are not unique. The linters I tried all show a duplicate key error. When marshaling and unmarshaling in Go or Java, there doesn't seem to be validation to reject the duplicate keys so the first item with the duplicate key is simply overwritten when the second duplicate key is encountered. I know the JSON specification doesn't explicitly state that duplicate keys aren't allowed but even in the underlying JavaScript, duplicate keys are almost certainly an error

I'm currently working on implementing this specification in Go and the bigger issue is with the second sentence - there is no way to tell a {componentName}:{measurementName} from a simple unique string. The following ambiguities occur:

Since both componentName and measurementName are optional, you might end up with a composite key of either ```` or : - both an empty string and a colon are valid keys in JSON but I think those should be avoided.
If only a componentName is present, should the colon be appended to this value?
If only a measurementName is provided, should the colon be prepended to this value?
If the answer to ambiguities 2 and 3 is no, how do I programmatically tell the difference between a simple string and a composite key that only contains a componentName or measurementName?
How do I programmatically tell the difference between a simple string that happens to contain a colon (or more than one colon) and a composite key?

In general, I think I like the idea of a key that has a semantic meaning so I'm in favor of continuing with the idea rather than just stating the key is a unique string.

initializing status?

I think I really want a status for initialising, ie "not yet healthy but so far can't see any reason why I won't be at some point". Or should I use a WARN state for that? These states can be quite long (eg database recovery is the classic example) and are not healthy in that requests can be served but they are not failed.

AffectedEndpoints should be optional

The affectedEndpoints section of the specification states:

A typical API has many URI endpoints. Most of the time we are interested in the overall health of the API, without diving into details. That said, sometimes operational and resilience middleware needs to know more details about the health of the API (which is why “checks” property provides details). In such cases, we often need to indicate which particular endpoints are affected by a particular check’s troubles vs. other endpoints that may be fine. The affectedEndpoints property is a JSON array containing URI Templates as defined by [RFC6570].

Each of the other sections starts with <name> (optional) is ... but from the sentence above, it's unclear whether this value is optional or required. The example clearly shows that it's optional and that if the array is empty, the whole field can be omitted. This paragraph should also state that, like the output section that the field should be omitted for “pass” state.

Perhaps the existing paragraph (which describes why the field exists) could be preceded by a paragraph that states what the field is (like the other fields). I'd propose the following text:

affectedEndpoints (optional) contains the URI for unhealthy endpoints. This field SHOULD be omitted for “pass” status.

Status code definition

There was a previous issue related to the use of a 4xx status code for fail which has now been closed. There seemed to be agreement that returning the status code of a dependent service could be useful but I'd like to agree with the original premise of that issue - an "HTTP Client Error" should not be returned from a healthcheck. The offending text is in section 3.1 and reads:

For "fail" status, HTTP response code in the 4xx-5xx range MUST be used.

The RFC that defines HTTP Status Codes contradicts the use of a 4xx as an indication of a server errors in section 10.4 - it's probably still appropriate for actual client errors (e.g. the client trying to access the healthcheck doesn't have permission).

Ecosystem of tools using RFC

If you have built a library or a tool that supports the RFC, please share it in this issue queue so others can easily find it.

"Additional Keys" instead of `details` is difficult for Java

The "additional keys" concept is more difficult to implement in Java than the previous details field of component-details.

Java frameworks don't consistently allow unknown fields to appear in JSON objects without special configurations or overrides.

A free form details or data field of type Map<String, Object> is preferable in more statically typed languages like Java and strictly schema validating frameworks like Java has.

Add structured "impacts" field for graceful degradation

Many micro-services have a few dependencies and it's pretty obvious that when the back-end's database connection has failed, the micro-service is completely down. Things aren't nearly so clear when the hierarchy of micro-services is more than one layer deep. I've added #51 in an effort to allow components to incorporated the top-level health response object as a check, but this hierarchy obviously also impacts how you calculate the top-level status. The failure of some components might only result in a top-level warn status. Or maybe a full-text indexer's failure wouldn't change a pass status since you could conceivably catch up later (and operate without it until then).

This is even more obvious when you consider the recommended UI practice of graceful degradation. For instance, our user account management system relies on about 20 different back-end systems but the UI only outright fails for a few of them. In such cases, we need a way for the UI's healthcheck end-point to report on the impact that each degraded or failed component has on the UI's back-end. We'd propose something like:

  ...
  "impacts": [
    {
      "impactId": <uuid string>,
      "checkKey": <string>,
      "impactDetail": <string>,
      "recommendsStatus": <status string>
    },
    ...
  ],
  ...

Three important notes about the format above:

The impactId field is primarily for debugging and log analysis. While a random UUID is suggested, the important aspect of this ID is that it's a unique, constant string.
The impactDetail field is a human-readable description that details what is currently non-functional.
The recommendedStatus field is NOT the status returned by the component but is rather the status that will be used to calculate the top-level status. This is a subtle difference but in many cases allows the top-level status to be calculated as the most severe of the impact recommendsStatuses.

Using the account UI healthcheck as an example, the JSON resulting from a Kerberos outage and an SMS outage would produce the following impacts section:

  ...
  "impacts": [
    {
      "impactId": "47619208-2556-41a4-a72c-801209b8ed9e",
      "checkKey": "kerberos:connection",
      "impactDetail": "The user will be unable to change their password",
      "recommendsStatus": "warn"
    },
    {
      "impactId": "85ad165d-9edf-4da5-8d95-93d299673680",
      "checkKey": "sms:connection",
      "impactDetail": "The user will be unable to perform self-service account recovery",
      "recommendsStatus": "warn"
    }
  ],
  ...

Receiving these impacts allows the UI to adopt a couple very useful behaviors:

The UI can use the impactDetail information to tell the user precisely which functions are not available (as simple as putting a toast at the top of a screen).
The UI can use the unique, constant impactId value to conditionally disable or hide the control elements for those functions.

Two more benefits of this format are:

The top-level status field can be calculated as warn based on the severity of the two underlying failures.
A human looking at the health response object can determine which checks contributed to the calculation of the top-level status.

Suggest default path `/health`

While the example suggests to the observant reader that it is a good idea to use /health as the default path to reach the health check endpoint, this is not mentioned anywhere in the text. May I suggest that the following paragraphs are added:

For interoperability, discovery, and ease of setup of tools that consume such an endpoint, health check endpoints SHOULD be reachable via a URL path that ends on /health. Preferably that path is reachable from the toplevel.

I'm sure the wording can be improved, but I guess you get what I mean.

HTTP status code for warn status

From the current draft:

In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response.

Just adding this to the discussion: the Nagios check_http plugin doesn't work this way. For a warning to be triggered, the HTTP status code must be in the 4xx range.
A Nagios warning is defined like this:

The plugin was able to check the service, but it appeared to be above some "warning" threshold or did not appear to be working properly

I think it would be interesting to see what other plugins do and also having a look at other monitoring systems.

"affectedEndpoints" field should be optional

Currently, every other field in the checks object is optional. As there are many types of sub-component that don't have affectedEndpoints, it makes sense to mark this field as optional as well.

Explicitly allow or disallow extra keys in "component details" objects

Concerning the component details object, the current draft of this standard reads as:

On the value side of the equation, each "component details" object in the array MAY have one of the following object keys:

The definition of this object does not make explicit mention of any additional keys which may or may not be present. However, the example seems to include additional keys named node in the component detail object under checks['cpu:utilization']. It would be preferred to have an explicit statement allowing these keys, to ensure that implementations remain cross-compatible.

response considerations

Make it explicit that details objects are unstandardized?

Is cpu the percentage of the allowed CPU for the logical node that the application is using? the percentage of the allowed CPU that is in use? something else? Not sure how unclear this actually is, but maybe it needs some thought. Things that would be useful to, say, a systems admin, would be "how much CPU is in use", "how much CPU is this app I'm checking using", "what is the load average of the logical node", and maybe other things.

Memory has some of the same issue, and also the issue that in 10 years, the absolute numbers will likely be much larger, since it's tied to kilobytes. Maybe memory_unit that could be one of "perc", "kB", "KiB", "MB", "MiB", "GB", "GiB"?

Rename additional-keys additional-properties

Presumably the additional-keys allowed in the component-details object would also have values. Calling these arbitrary key-value pairs additional-properties aligns with the definition provided by JSON Schema.

Should additional-properties be allowed in the health object?

The component-details objects provided in the Checks arrays allow additional-properties (additional-keys). Should arbitrary JSON key-value properties also be allowed in the top-level health object?

media type name stability

as with the infamous X-... naming patterns for header fields (https://tools.ietf.org/html/rfc6648), the same applies to media types: assigning a name in the draft and then planning to change it is not the best strategy. pick a name you'd like to keep, and then stick to it. otherwise you build fragmentation into the development/adoption process, which makes it harder for people to understand which identifier to use.