Coder Social home page Coder Social logo

Comments (3)

GreyCat avatar GreyCat commented on June 10, 2024 1

Hey, @generalmimon, as always, this is very very thorough and I can only thank you for doing all this work. This indeed looks like a magnificent improvement and fixes quite a lot of bugs that have pestered us and our users for years!

I agree with all your assessments. For YAML 1.1 vs YAML 1.2 — I believe ultimately it we'll have to settle on certain pattern of treatment these values holistically — e.g.:

  • either go "all strict everywhere", demanding all values coming in to be exactly strings (i.e. no foo: true or foo: 123 allowed, it must be always foo: "true" or foo: "123")
  • or go "we never demand a specific type when we need value, we're ok to be flexible to accept certain numbers and booleans as values, and we're ok for YAML parser to do it, otherwise our own parse will take it".

So far, looks like most of the places when we had it leans towards the latter (it makes it very useful for authoring YAML by hand).

from kaitai_struct_webide.

generalmimon avatar generalmimon commented on June 10, 2024

So basically both nodeca/js-yaml and eemeli/yaml look great. However, there are a few practical differences after all.

YAML 1.1 vs. YAML 1.2

eemeli/yaml

More focused on the compliance with YAML specs. It claims to pass all of tests in the YAML test suite (https://github.com/eemeli/yaml/blob/b7696fc001837a2e9d66ad78d7c04f47943daeca/README.md?plain=1#L7):

It allows choosing between YAML 1.1 and YAML 1.2 mode. The differences are listed at https://yaml.org/spec/1.2.2/ext/changes/#changes-in-version-12-revision-120-2009-07-21. These are relevant to our purposes (I'll add numbering for easier reference):

  1. Only true and false strings are parsed as booleans (including True and TRUE); y, yes, on, and their negative counterparts are parsed as strings.
  2. Underlines _ cannot be used within numerical values.
  3. The binary and sexagesimal integer formats have been dropped.

Basically, we want number 1 quite badly, because otherwise seq: [{id: x}, {id: y}] turns to {"seq": [{"id": "x"}, {"id": true}]} (this would obviously affect .ksy files that contain id: y with y unquoted, which is... all of them: https://github.com/search?q=repo%3Akaitai-io%2Fkaitai_struct_formats+%22id%3A+y%22&type=code).

(BTW, SnakeYAML used in KSC builds for JVM targets YAML 1.1, so it parses yes and on as booleans, but apparently makes an exception for y such that y is parsed as a string, not as a boolean.)

However, at the moment, KSC code implicitly relies on the YAML parser to behave according to YAML 1.1 in points 2 and 3 (i.e. recognize underscores _ in numbers and support binary notation). Otherwise, you couldn't use numbers with underscores and binary notation at places where KSC expects to get a number from the YAML parser, i.e.:

  1. enum values

    Details
    • underscores _

      enums:
        fruit:
          '0x12_34': apple
      enum_test.ksy: /enums/fruit:
              error: unable to parse `0x12_34` as int
      
    • binary notation 0b...

      enums:
        fruit:
          '0b11': banana
      enum_test.ksy: /enums/fruit:
              error: unable to parse `0b11` as int
      
  2. terminator and pad-right (though this probably wouldn't be a big deal)

This is quite easy to fix on the KSC side, though. Even if the YAML library parsed all integers as strings (this would be the case if we'd ever configure the YAML parser to use failsafe schema, BTW), it wouldn't be a huge problem, but it would require some additional code in KSC.

nodeca/js-yaml

Although it claims to follow YAML 1.2 (https://github.com/nodeca/js-yaml/blob/2cef47bebf60da141b78b085f3dea3b5733dcc12/README.md?plain=1#L1):

JS-YAML - YAML 1.2 parser / writer for JavaScript

... in practice it still supports some features of YAML 1.1 removed in YAML 1.2. In particular underscores _ in numeric literals and binary notation 0b.... This is convenient for us at the moment, as it means these features will be supported in enum values (and terminator, pad-right) without any changes to KSC.

Error messages

I personally find error messages of nodeca/js-yaml nicer and more understandable, even though eemeli/yaml's errors are somewhat more descriptive instead of just "bad indentation" (but that's probably enough in many cases).

Details

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L43-L52

--- 1/results/yaml-1.2.txt
+++ 2/results/js-yaml.txt
@@ ... @@
-> YAMLParseError: Nested mappings are not allowed in compact mappings at line 3, column 12:
->
->     value: condition ? 4 : 8
->            ^
+> YAMLException: bad indentation of a mapping entry in "input.ksy" (3:26)
 >
+>  1 | instances:
+>  2 |   foo:
+>  3 |     value: condition ? 4 : 8
+> ------------------------------^

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L136-L145

-> YAMLParseError: Nested mappings are not allowed in compact mappings at line 2, column 7:
->
->   id: yaml_1
->       ^
+> YAMLException: bad indentation of a mapping entry in "input.ksy" (3:8)
 >
+>  1 | meta:
+>  2 |   id: yaml_1
+>  3 |     seq:
+> ------------^

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L195-L208

-> YAMLParseError: All mapping items must start at the same column at line 8, column 1:
->
->       doc: An animal species
->     seq:  # line 8, but yaml.js says line 4!
-> ^
+> YAMLException: bad indentation of a mapping entry in "input.ksy" (8:5)
 >
+>  5 | types:
+>  6 |   animal:
+>  7 |       doc: An animal species
+>  8 |     seq:  # line 8, but yaml.js says ...
+> ---------^
+>  9 |       - id: species
+>  10 |         type: s4

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L157-L170

-> YAMLParseError: Map keys must be unique at line 6, column 1:
->
->     type: u1
-> seq:
-> ^
+> YAMLException: duplicated mapping key in "input.ksy" (6:1)
 >
+>  3 | seq:
+>  4 |   - id: foo
+>  5 |     type: u1
+>  6 | seq:
+> -----^
+>  7 |   - id: bar
+>  8 |     type: u1

nodeca/js-yaml also seems to have better detection of duplicate keys - it's able to handle even this case (this is rather an edge case, but still):

https://github.com/generalmimon/js-yaml-parsers-test/blob/94d79aab838a174668c7077f6e483221f867fda1/results-diffs/yaml-1.2_vs_js-yaml.diff#L172-L181

2: "bar"
"2": "foo"
--- 1/results/yaml-1.2.txt
+++ 2/results/js-yaml.txt
@@ ... @@
-{ '2': 'foo' }
+ERROR:
+> YAMLException: duplicated mapping key in "input.ksy" (2:1)
+>
+>  1 | 2: "bar"
+>  2 | "2": "foo"
+> -----^

Ease of integration

nodeca/js-yaml is easier to integrate in the Web IDE, because there's already a pre-built minified JS file: https://github.com/nodeca/js-yaml/blob/4.1.0/dist/js-yaml.min.js

eemeli/yaml doesn't provide a single packaged minified file, see eemeli/yaml#480 (comment). When you install it from npm, there's just the node_modules/yaml/browser folder with 75 unminified files, which are 268 KiB in total:

pp@DESKTOP-89OPGF3 MINGW64 /c/temp/js-yaml-parsers-test/node_modules/yaml/browser (master)
$ find . -type f -exec wc -c {} +
...
274427 total

In comparison, nodeca/js-yaml has a .min.js file, which is 38.5 KiB in size:

pp@DESKTOP-89OPGF3 MINGW64 /c/temp/js-yaml-parsers-test/node_modules/js-yaml/dist (master)
$ wc -c js-yaml.min.js
39430 js-yaml.min.js

That's almost 7 times smaller, and it's only 1 file instead of 75 individual ones, meaning the browser has to send only 1 request instead of 75, so it should all be faster. (Yes, technically it's not really fair to compare unminified files to a minified one, but unfortunately no proper minification/packaging is part of our existing Web IDE infrastructure, and I don't want to spend time on it at this point.)

Conclusion

In the end I chose nodeca/js-yaml for the reasons I stated above.

It wouldn't be a big problem to switch to eemeli/yaml if we ever wanted to, but it would require some changes to KSC and the Web IDE infrastructure (add minification step etc.). For now, nodeca/js-yaml is easier to switch to, and I suppose it will work a bit better for us than eemeli/yaml would.

Obviously, it would be best to use a YAML parser in Scala that works in all environments (kaitai-io/kaitai_struct#229), but until that becomes a reality, I believe switching to nodeca/js-yaml is much better than staying with jeremyfa/yaml.js.

from kaitai_struct_webide.

generalmimon avatar generalmimon commented on June 10, 2024

@GreyCat Thanks for your comment!

  • either go "all strict everywhere", demanding all values coming in to be exactly strings (i.e. no foo: true or foo: 123 allowed, it must be always foo: "true" or foo: "123")
  • or go "we never demand a specific type when we need value, we're ok to be flexible to accept certain numbers and booleans as values, and we're ok for YAML parser to do it, otherwise our own parse will take it".

So far, looks like most of the places when we had it leans towards the latter (it makes it very useful for authoring YAML by hand).

Yeah, the latter option sounds more reasonable of the two. I don't see a reason for option one, that sounds too radical and unnecessary (but at least using two pairs of quotes to express a string literal in a KS expression as in https://doc.kaitai.io/user_guide.html#_switching_over_strings wouldn't feel weird anymore 🙂).

Instead, as you've already mentioned in kaitai-io/kaitai_struct#229 (comment):

What we're generally ok to drop

Some YAML compatibility, i.e. we're generally ok to implement a smaller YAML subset. For example, we don't need:

... we can eventually configure our YAML parsers to use failsafe schema, which basically disables scalar interpretation done by the YAML parser. In this mode, there are only 3 data types the YAML parser recognizes: mapping, sequence and string. This means that it no longer knows anything about null, booleans, integers or floats - you'll get any representation of these types (that would normally be mapped to the appropriate type) as a plain string.

For KSC purposes, I suppose this would be pretty much perfect and all we really need. In our case, having the YAML parser produce all these different data types is more annoying than helpful.

For example, right now if you want to have a value instance in the Web IDE with an integer literal that requires more than 53 bits of precision to store exactly, you have to wrap it into quotes like value: '9007199254740993' so that the number is only parsed from string by our expression parser in KSC (which uses BigInts), rather than by the JavaScript YAML parser (as with unquoted value: 9007199254740993), which uses the native JS number type (a.k.a. double, i.e. a 64-bit IEEE 754 floating-point number). However, this inevitably leads to losing precision, since double can only represent 9007199254740992 or 9007199254740994, but nothing in between.

But even on the JVM with SnakeYAML that reads big integers as java.math.BigInteger, the KSC code then converts it back to string, so the string-integer conversion done by SnakeYAML isn't much appreciated either :) (see https://github.com/kaitai-io/kaitai_struct_compiler/blob/3af7a08e0f67212f42a4f8b97a0b25a72395a2c4/jvm/src/main/scala/io/kaitai/struct/formats/JavaKSYParser.scala#L99-L100):

      case _: java.math.BigInteger =>
        src.toString

But at least there's no loss of precision.


Looking at https://doc.kaitai.io/ksy_diagram.html, most of YAML keys that we recognize in .ksy specs are either pure strings (e.g. id, doc) or expressions (if, size), where it's perfectly fine to always receive them as strings as well.

There are a few pure boolean keys (see also https://github.com/search?q=repo%3Akaitai-io%2Fkaitai_struct_compiler%20getOptValueBool&type=code):

  • meta/ks-debug
  • meta/ks-opaque-types
  • size-eos
  • consume
  • include
  • eos-error

and pure integer keys (see https://github.com/search?q=repo%3Akaitai-io%2Fkaitai_struct_compiler+getOptValueInt&type=code):

  • terminator
  • pad-right

And as already mentioned, the integer keys of enum entries would have to be parsed too.

But it's not a big problem to adapt these in KSC to accept strings instead and do the appropriate parsing.

Perhaps the only potential problem is that this change will create some slight incompatibilities with other standalone tools that consume .ksy files and don't use failsafe schema (most likely some situations that were allowed with a YAML parser using implicit typing won't be allowed anymore and vice versa). This is because with failsafe schema, we simply always get strings, but we have no way of knowing whether a scalar was unquoted (meaning that it may be subject to implicit typing in YAML parsers that don't use failsafe schema) or not. For example, size-eos: 'true' or terminator: '0x00' would now be allowed, even though they weren't before. On the other hand, size-eos: yes or size-eos: on would now be probably disallowed (unless we decided to allow these YAML 1.1 representations of true, but I'd rather not to - just literal true and false are enough IMO), although they worked before. But this is probably not a big deal.

Both nodeca/js-yaml and eemeli/yaml that I've talked about in this issue support failsafe schema. SnakeYAML we're currently using doesn't seem to, but SnakeYAML Engine claiming to be a YAML 1.2 processor (as opposed to SnakeYAML, which targets YAML 1.1) apparently supports it (see https://bitbucket.org/snakeyaml/snakeyaml-engine/src/f4b8cdb1e846d5354fb4791916ba3c5932496770/src/test/java/org/snakeyaml/engine/schema/FailsafeTest.java).

from kaitai_struct_webide.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.