nullus157 / cbor-diag-rs Goto Github PK
View Code? Open in Web Editor NEWSupport for parsing/encoding CBOR diagnostic notation and annotated hex
Home Page: https://cbor.nemo157.com
License: Apache License 2.0
Support for parsing/encoding CBOR diagnostic notation and annotated hex
Home Page: https://cbor.nemo157.com
License: Apache License 2.0
https://www.rfc-editor.org/rfc/rfc8742.html
For example, if support for cbor sequences were present, I believe the following should parse:
{
echo '{"a":1}' | cbor-diag --to=bytes ;
echo '{"b":2}' | cbor-diag --to=bytes ;
} | cbor-diag --to=annotated
Ideally the following would work as well (accept jsonl/ndjson and other representations of sequences of documents):
{ echo '{"a":1}' ; echo '{"b":2}' ; } | cbor-diag --to=annotated
The CLI should support colorizing the output data to make it easier to read, both diagnostic and annotated hex encoding.
This will probably have to be generated by the library, easiest might be to inject ANSI escape codes in the output strings, can probably be used in the website with something like https://www.npmjs.com/package/ansi-to-html (how to support Windows terminals? hopefully there's something that can parse ANSI escapes and generate the correct commands, or maybe there's a library that gives higher-level colorized strings via annotated spans that can be converted into different formats as needed).
An example of what jq
does for colorizing JSON which the diagnostic colorization can probably be based on:
I'm undecided whether this is worthwhile, but the AST could have nodes added to retain comments when parsing extended diagnostic notation. These could then be included when outputting either diagnostic notation, or as part of the annotated hex output.
This wouldn't really help with what I see as the main usecases though, converting EDN -> bytes for tooling that wants human writeable input, or converting bytes -> {EDN, annotated hex} for tooling that wants to show human readable output.
Would it be possible to add a json output? Basically similar to the diag/compact form but without things like tags that throw off json parsing?
Probably should be done before #17 in case the errors change.
proptest
depends on rand
, which currently fails to build with minimal versions: rust-random/rand#741
Once that's resolved should try to change the minimal CI job from just building to actually running the tests.
Currently the cbor2diag
tool seems to treat whitespace within hex-encoded bytestrings differently than for base64-encoded bytestrings. An example of bad behavior is below. It seems that the where the space is in the base64 sequence doesn't matter. Any space causes this issue.
Section G.1 of RFC 8610 indicates that all whitespace is ingored. Its examples uses hex encoding but my interpretation it it should also apply to base64 encoding.
[STRIP ~]$ echo "h'6865 6C6C 6F'" | diag2pretty.rb
45 # bytes(5)
68656c6c6f # "hello"
[STRIP ~]$ echo "b64'aGVsbG8='" | diag2pretty.rb
45 # bytes(5)
68656c6c6f # "hello"
[STRIP ~]$ echo "b64'aGVs bG8='" | diag2pretty.rb
*** can't parse b64'aGVs bG8='
*** Expected one of [0-9a-zA-Z_\-+/], [=], "'" at line 1, column 9 (byte 9) after b64'aGVs
Some very quick testing via cargo-wasi
shows that everything except colored output appears to work (probably need to add some support for something in termcolor
). From a release build wasmtime
appears to need ~7s on a Macbook Air to JIT the optimized wasm blob on first run, and it's pretty much instantaneous on future runs.
How to actually do this? 🤷♂
One additional thing I would like to do with this is to sandbox the tool more. The only thing it needs access to is stdin+stdout+args, all network and filesystem access should be disabled. It doesn't look like running from wasmtime
as a CLI tool supports this, so maybe cbor-diag-cli
should use wasmtime
as a library if there's some way to do so there.
Byte strings have two wide-spread serialization variants: 'text' and h'74657874' (and the rarer b32, h32 and b64, which I personally don't care about but hey they're there) prefixes. It would be nice if this could be preserved, maybe as an extra Option property of ByteString.
Looking at RFC8610 Appendix G Extended Diagnostic Notation provides even more options (including internal whitespace and embedded CBOR); they are more complex and not really on my wish-list, but it might be good to be aware of it when implementing to not duplicate work if that later becomes relevant.
This would be especially convenient when building a diagnostic notation programmatically.
This would probably share patterns with #117, in that it is a property that is set when coming from DN, but unavailable when coming from CBOR. Filling out those gaps when going from arbitrary CBOR to DN could be done by the user at the AST stage by applying arbitrary heuristics, some of which may be provided by cbor-diag-rs, but that's ultimately application specific. (For example, a simple universal heuristic would be taking the ratio of printable ASCII characters; a more application specific choice might be guided by CDDL).
Given how public the core DataItem type is, I think that issues #117, #132 and #138 will all need API changes: Neither can DataItem gain another variant CommentedItem { comment_before: String, comment_after: String, item: Box<DataItem>}
, nor can ByteString gain a field that tells whether it's a '', a h'' or a b64'' string.
I've briefly tried sprinkling in a few #[non_exhaustive]
(which are a breaking API change), but they had the side effect of breaking construction as DataItem::Integer { value: 10, bitwidth: 0 }
as that may have missed fields.
I suggest that a single API change be made in which a lot of API is made private (possibly DataItem even turns into a struct so it can have hidden internals). Maybe the breaking change would not even add the features, just change the types so that extensibility is possible. After, things could look like constructors that are more focused on the values:
let item = DataItem::integer(42);
We could still allow setting bit widths etc, but I don't think that it's practically needed often. (It gets set through internal access anyway when deserializing CBOR, but when constructing diagonstic notation programmatically, I don't think it should be the focus of ergonomics).
Thing is: I don't know the typical use of the crate well enough to make a full proposal yet -- I mainly use it to do a full conversion without ever touching intermediate artifacts (yet -- once #132 is in, I'll start reaching in more).
Needed for CLI, would be useful for generating links in webpage as well
Affects chrono
< 0.4.20
The current "pretty" format uses an estimation of when a value is "trivial" in order to print it on one line when it would normally be split to multiline, e.g. an array containing a single trivial element, a map containing a trivial key-value pair:
This has issues, e.g. { "hello": "world" }
will be kept on one line, but [1, 2, 3, 4]
will be split over 6 lines despite being a similar size/complexity.
A better heuristic would be to perform an estimate of how much space the value will take to print, and set an arbitrary upper limit on that. For most data the estimate can probably be perfect, but some like floating point numbers are probably not worth calculating exactly. Some data types like maps might apply an extra "cost" to their contained values because of their relative complexity.
For interactive editing (highlighting cursor positions in a two-paned hex and diagnostic view), or for debugging (implementing pd-body-error-position), it would be cool to match ranges of bytes encoded in CBOR to ranges of bytes encoded in diagnostic notation -- similar to how a compiler outputs debug information matching instructions to source lines.
This tangentially related to #20, as it would pave the way to color-highlighting hex output.
One thing that'll make this relatively hard for this crate is that it's interconverting via a mutable AST (which on its own is great, just needs some more effort here). A relatively easy API would be to turn a CBOR byte string into a DN text string (or vice versa), and also produce a source map as a list of corresponding (frequently nested) ranges. There's probably a design pattern by which the AST can keep cursors in two serializations, but I don't know how to make a pretty API out of it, or how to do it with neither pinning nor Rc'ing nor indices for which it isn't completely clear which slice they relate to.
With edn-literals recently adopted by the CBOR WG, I'd love to see support for that in here. Implementing cri"" won't make much sense yet because CRI is not finished (when it is, I can provide the workhorse crate), but dt"" seems stable enough.
Like EDN embedded CBOR, this can be ingested without any extra processing, but producing it will need decisions (with "we don't produce it" being a viable first step). The same mechanism that would resolve #132 could be used to eventually produce them -- it'd just need generalization (because now suddenly a float or integer can either be encoded directly with a bitwidth, or the dt EDN literal).
This issue can be reproduced ONLY when compiling the CLI in Debug
; it does NOT occur when compiling the CLI in Release
.
Run
$ cargo build -p cbor-diag-cli
then run
$ echo a300470128bf0000002c01c11a62978f4d02a100820400 | ./target/debug/cbor-diag --to diag
thread 'main' panicked at 'attempt to subtract with overflow', src/encode/diag.rs:42:42
The CBOR diagnostic notation of the hex string is
{
0: h'0128bf0000002c',
1: 1(1654099789_2),
2: {0: [4, 0]},
}
The crash occurs because the len
variable is hardcoded to 4
and the max
parameter given to the estimate()
function is less than 4
, overflowing the unsigned subtraction max - len
.
Thanks a lot for making and sharing this library!
When copy pasting the example from crates.io output is "<< was unexpected at this time."
None of the examples listed work for me....
Should probably change the current --to diag
output to output prettified diagnostic notation and add another --to compact
or something to output a compact form.
Revert code changes from Nemo157@77d606c once strum 0.16 is released (hopefully soon Peternator7/strum#69).
Building nom with a recent Rust raises warnings about future compatibility:
warning: the following packages contain code that will be rejected by a future version of Rust: nom v5.1.2
It appears that nom uses some macro syntax that was only accidentally valid.
I just briefly tried bumping the nom version to 7, but while I could do the change from separated_list to separated_list0 easily (nom 6's changelog says it was effectively renamed), other errors seem to require understanding what it does, for which unfortunately I don't have the time right now.
CLI should support streaming in undelimited CBOR data items. Will need some kind of update to the library to support reading a prefix of an input and returning where it ended.
Hi,
I am new to rust but want to use the cbor-diag-cli. When using the cargo command to install I get a buch of errors of the same category:
error[E0632]: cannot provide explicit generic arguments when `impl Trait` is used in argument position
--> /home/fmetz/.cargo/registry/src/github.com-1ecc6299db9ec823/cbor-diag-0.1.11/src/encode/hex.rs:506:27
|
506 | typed_array::<1>(context, value, "unsigned", |[byte]| byte.to_string())
| ^ explicit generic argument not allowed
|
= note: see issue #83701 <https://github.com/rust-lang/rust/issues/83701> for more information
Any help ?
Best Regards
Fréd
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.