<a href="https://github.com/Cyphrme/Coze/blob/01c154e4024b4e876b8d152166ce85cf2a945e22

How is the URI encoding non-standard? <a

I opened a new issue that's related to base 64 encoding: <a class="issue-link js-issu

Base64 encoding can only elide padding when the size of encoded data is known about coze HOT 7 CLOSED

cyphrme commented on July 18, 2024 1

Base64 encoding can only elide padding when the size of encoded data is known

from coze.

Comments (7)

zamicol commented on July 18, 2024

How is the URI encoding non-standard?

There's some general things about RFC base64 that should be said first: Padding
characters help satisfy length requirements and carry no other meaning. It's
always possible to determine the length of the input unambiguously from the
length of the encoded sequence. Since Coze does not concatenate unpadded base64
strings, Coze does not need base64 padding. Coze only concatenates the binary
form, not base64.

For RFC base64 /w== and /w are equal to
11111111.
The padding doesn't do anything. Also, omitting padding saves precious message
space.

To summarize, there's a few reasons why padding is not needed:

The purpose of padding is to explicitly denote "empty bytes". It's already
redundant.
It's in JSON.
Digests and Cryptographic signatures.

Padding is never needed to correctly decode a base64 message as long as the
whole message was transported.
Coze also doesn't have to worry about base64 concatenation since base64 is in
JSON. Double quote serves as the base64 string terminator.
If something went wrong with encoding or transport, Coze verification simply
fails since cryptographic functions are serving the function of integrity
checking. As long as a Coze message verifies, it's also integral.

As an aside, JOSE already does the same
thing, and it's widely used
in industry.

As a historical anecdote, Coze used to encode with Hex, because it is more human
readable, Hex is always twice as large as the binary form, and it doesn't need
padding. On the other hand b64ut does not have a static multiplicative
relationship with binary, is less human readable, and for a few edge cases
padding can be useful. After considering the message size savings, we dropped
Hex in favor of RFC b64ut.

Satoshi had the same concerns with base64 and thus base 58. We decided that RFC
base64 is good enough and that implementing an alternative base conversion
system would be more trouble than it's worth. That has not stopped others from
doing so (See Keybase's solution is linked with others at the bottom of the
base conversion tool) If we had chosen a
different base conversion method, I would have liked to use a higher base (like
a base 91 alphabet) which results in shorter sized messages. However, then
character escaping become an issue. At that point, a purely binary form of Coze
would be better. Base64 is "right sized", it has enough characters to make
messages reasonable short, while not having so many that it requires an
excessive amount of escaping for various applications.

from coze.

zamicol commented on July 18, 2024

I don't mean "close the issue" for no more feedback, but I don't believe this is a concern. (I'm a bit of a GitHub dunce, please forgive any of my social blunders done by clicking green buttons.)

I appreciate you reading Coze and poking holes into it. This is exactly what needs to be done, and I want to motivate skepticism as much as I can.

from coze.

peterbourgon commented on July 18, 2024

How is the URI encoding non-standard?

You use base64.URLEncoding. That is described as

URLEncoding is the alternate base64 encoding defined in RFC 4648. It is typically used in URLs and file names.

whereas base64.StdEncoding is described as

StdEncoding is the standard base64 encoding, as defined in RFC 4648.

Padding characters help satisfy length requirements and carry no other meaning. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.

As far as I can tell, this is a mis-reading of the relevant requirements, and not correct. It is only possible to unambiguously decode a base64 encoded string in isolation if padding characters are included. If padding characters are elided, then it is only possible to unambiguously decode that string if the length is communicated out-of-band.

Quoting https://www.rfc-editor.org/rfc/rfc4648#section-3.2

   In some circumstances, the use of padding ("=") in base-encoded data
   is not required or used.  In the general case, when assumptions about
   the size of transported data cannot be made, padding is required to
   yield correct decoded data.

   Implementations MUST include appropriate pad characters at the end of
   encoded data unless the specification referring to this document
   explicitly states otherwise.

from coze.

zamicol commented on July 18, 2024

base64.StdEncoding

It's a matter of semantics. base64.URLEncoding is standardized by the same RFC. We've dubbed it more specifically b64ut. Even though it's not what the RFC names as the standard alphabet, URI encoding is standardized formally by that RFC 4648.

We especially felt the need to dub it b64ut to avoid confusion with the generalized arbitrary base 64 which uses the "iterative divide by radix" method and is sometimes equal to RFC base64.

The JOSE JWS RFC specifically addresses that:

As per the example code above, the number of '=' padding characters
that needs to be added to the end of a base64url-encoded string
without padding to turn it into one with padding is a deterministic
function of the length of the encoded string. Specifically, if the
length mod 4 is 0, no padding is added; if the length mod 4 is 2, two
'=' padding characters are added; if the length mod 4 is 3, one '='
padding character is added; if the length mod 4 is 1, the input is
malformed.

And that is correct. Padding is always deterministically recreatable as long as the original message is given.

RFC 4648 is basically referring to streaming, where the original message may not be given, and I think it's one of the more confusingly worded section. If a stream ends mid stream, without padding it may not be known that the stream ended or if there's an error. For batch processing, this isn't relevant. Firstly, the transport (TCP) will most likely error, then JSON itself will be malformed, the digest will be bad, and the cryptographic signature will not be valid. There's many layers of defense against having to worry about padding in Coze.

from coze.

peterbourgon commented on July 18, 2024

It's a matter of semantics. base64.URLEncoding is standardized by the same RFC. We've dubbed it more specifically b64ut. Even though it's not what the RFC names as the standard alphabet, URI encoding is standardized formally by that RFC 4648.

"Standard" does not mean "any one of the alphabets defined by the authoritative RFC", it means "the specific alphabet denominated as Standard by the authoritative RFC", which is explicitly not the URI encoding.

The JOSE JWS RFC

Where is this RFC referenced?

Padding is always deterministically recreatable as long as the original message is given . . . RFC 4648 is basically referring to streaming, where the original message may not be given, and I think it's one of the more confusingly worded section. If a stream ends mid stream, without padding it may not be known that the stream ended or if there's an error.

None of these claims are correct. Padding is not deterministically re-createable, because "the original message" is not guaranteed to be knowable by a given recipient. Neither does RFC 4648 apply only to "streaming" use cases.

I'll disengage at this point.

from coze.

zamicol commented on July 18, 2024

The JOSE JWS RFC 7515 which also doesn't use padding. See in particular Appendix C.

is not guaranteed to be knowable by a given recipient

We wrote the Coze spec assuming that systems can can calculated the length of the digests given in messages, but even if that capability is not present in a particular system, that system still can use JSON validation, digests, or cryptographic verification to ensure message are well-formed. So even for systems that for some reason cannot calculate the length of the base 64 messages, padding is still not needed.

The only time padding cannot be deterministically reconstructed is if the base 64 payload is malformed or if the receiving end doesn't have the capability to calculate the length of the payload, which is a weird and easily solvable problem on modern systems. Perhaps there's a technical edge case where this is a concern for minimal hardware system that are implementing Coze? If you have something in particular in mind, I'd like to know more about those technical constraints. Go Coze and Javascript Coze have no issue implementing this constraint. If the length can be known by the payload, it's always possible to determine the length of the input unambiguously from the length of the encoded sequence.

A good argument is made by Appendix C. It appears to me that there's no need to be concerned about padding since the code needed to reconstruct padding, if needed, is minimal and straightforward.

from coze.

zamicol commented on July 18, 2024

I opened a new issue that's related to base 64 encoding: #18

from coze.

Base64 encoding can only elide padding when the size of encoded data is known about coze HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent