lann / wave Goto Github PK

Web Assembly Value Encoding

License: Apache License 2.0

Rust 100.00%

wave's Introduction

WAVE: Web Assembly Value Encoding

WAVE is a human-oriented text encoding of WebAssembly Component Model values. It is designed to be consistent with the WIT IDL format.

Type	Example Values
Bools	`true`, `false`
Integers	`123`, `-9`
Floats	`3.14`, `6.022e+23`, `nan`, `-inf`
Chars	`'x'`, `'☃︎'`, `'\''`, `'\u{0}'`
Strings	`"abc\t123"`
Tuples	`("abc", 123)`
Lists	`[1, 2, 3]`
Records	`{field-a: 1, field-b: "two"}`
Variants	`days(30)`, `forever`
Enums	`south`, `west`
Options	`"flat some"`, `some("explicit some")`, `none`
Results	`"flat ok"`, `ok("explicit ok")`, `err("oops")`
Flags	`{read, write}`, `{}`

Usage

use wasmtime::component::{Type, Val};

let val: Val = wasm_wave::from_str(&Type::String, "\"👋 Hello, world! 👋\"").unwrap();
println!("{}", wasm_wave::to_string(&val).unwrap());

→ "👋 Hello, world! 👋"

Encoding

Values are encoded as Unicode text. UTF-8 should be used wherever an interoperable binary encoding is required.

Whitespace

Whitespace is insignificant between tokens and significant within tokens: keywords, labels, chars, and strings.

Comments

Comments start with // and run to the end of the line.

Keywords

Several tokens are reserved WAVE keywords: true, false, inf, nan, some, none, ok, err. Variant or enum cases that match one of these keywords must be prefixed with %.

Labels

Kebab-case labels are used for record fields, variant cases, enum cases, and flags. Labels use ASCII alphanumeric characters and hyphens, following the Wit identifier syntax:

Labels consist of one or more hypen-separated words.
- one, two-words
Words consist of one ASCII letter followed by any number of ASCII alphanumeric characters.
- q, abc123
Each word can contain lowercase or uppercase characters but not both; each word in a label can use a different (single) case.
- HTTP3, method-GET
Any label may be prefixed with %; this is not part of the label itself but allows for representing labels that would otherwise be parsed as keywords.
- %err, %non-keyword

Bools

Bools are encoded as one of the keywords false or true.

Integers

Integers are encoded as base-10 numbers.

TBD: hex/bin repr? e.g. 0xab, 0b1011

Floats

Floats are encoded as JSON numbers or one of the keywords nan, (not a number) inf (infinity), or -inf (negative infinity).

Chars

Chars are encoded as '<char>', where <char> is one of:

a single Unicode Scalar Value
one of these escapes:
- \' → '
- \" → "
- \\ → \
- \t → U+9 (HT)
- \n → U+A (LF)
- \r → U+D (CR)
- \u{···} → U+··· (where ··· is a hexadecimal Unicode Scalar Value)

Escaping newline (\n), \, and ' is mandatory for chars.

Strings

Strings are encoded as a double-quote-delimited sequence of <char>s (as for Chars).

Escaping newline (\n), \, and " is mandatory for strings.

Multiline Strings

A multiline string begins with """ followed immediately by a line break (\n or \r\n) and ends with a line break, zero or more spaces, and """. The number of spaces immediately preceding the ending """ determines the indent level of the entire multiline string. Every other line break in the string must be followed by at least this many spaces which are then omitted ("dedented") from the decoded string.

Each line break in the encoded string except for the first and last is decoded as a newline character (\n).

Escaping \ is mandatory for multiline strings. Escaping carriage return (\r) is mandatory immediately before a literal newline character (\n) if it is to be retained. Escaping " is mandatory where necessary to break up any sequence of """ within a string, even if the first " is escaped (i.e. \""" is prohibited).

"""
A single line
"""

→ "A single line"

"""
    Indentation determined
      by ending delimiter
  """

→

"  Indentation determined\n    by ending delimiter"

"""
  Must escape carriage return at end of line: \r
  Must break up double quote triplets: ""\""
  """

→

"Must escape carriage return at end of line: \r\nMust break up double quote triplets: \"\"\"\""

Tuples

Tuples are encoded as a parenthesized sequence of comma-separated values. Trailing commas are permitted.

tuple<u8, string> → (123, "abc")

Lists

Lists are encoded as a square-braketed sequence of comma-separated values. Trailing commas are permitted.

list<char> → [], ['a', 'b', 'c']

Records

Records are encoded as curly-braced set of comma-separated record entries. Trailing commas are permitted. Each record entry consists of a field label, a colon, and a value. Fields may be present in any order. Record entries with the option-typed value none may be omitted; if all of a record's fields are omitted in this way the "empty" record must be encoded as {:} (to disambiguate from an empty flags value).

record example {
  must-have: u8,
  optional: option<u8>,
}

→ {must-have: 123} = {must-have: 123, optional: none,}

record all-optional {
  optional: option<u8>,
}

→ {:} = {optional: none}

Note: Field labels may be prefixed with % but this is never required.

Variants

Variants are encoded as a case label. If the case has a payload, the label is followed by the parenthesized case payload value.

If a variant case matches a WAVE keyword it must be prefixed with %.

variant response {
  empty,
  body(list<u8>),
  err(string),
}

→ empty, body([79, 75]), %err("oops")

Enums

Enums are encoded as a case label.

If an enum case matches a WAVE keyword it must be prefixed with %.

enum status { ok, not-found } → %ok, not-found

Options

Options may be encoded in their variant form (e.g. some(...) or none). A some value may also be encoded as the "flat" payload value itself, but only if the payload is not an option or result type.

option<u8> → 123 = some(123)

Results

Results may be encoded in their variant form (e.g. ok(...), err("oops")). An ok value may also be encoded as the "flat" payload value itself, but only if it has a payload which is not an option or result type.

result<u8> → 123 = ok(123)
result<_, string> → ok, err("oops")
result → ok, err

Flags

Flags are encoded as a curly-braced set of comma-separated flag labels in any order. Trailing commas are permitted.

flags perms { read, write, exec } → {write, read,}

Note: Flags may be prefixed with % but this is never required.

TBD: Allow record form? {read: true, write: true, exec: false}

Resources

TBD (<named-type>(<idx>)?)

Appendix: Function calls

Some applications may benefit from a standard way to encode function calls and/or results, described here.

Function calls can be encoded as some application-dependent function identifier (such as a kebab-case label) followed by parenthesized function arguments.

If function results need to be encoded along with the call, they can be separated from the call by ->.

my-func("param")
    
with-result() -> ok("result")

Function arguments

Arguments are encoded as a sequence of comma-separated values.

Any number of option none values at the end of the sequence may be omitted.

// f: func(a: option<u8>, b: option<u8>, c: option<u8>)
// all equivalent:
f(some(1))
f(some(1), none)
f(some(1), none, none)

TBD: Named-parameter encoding? e.g. my-func(named-param: 1) Could allow omitting "middle" none params.

Function results

Results are encoded in one of several ways depending on the number of result values:

Any number of result values may be encoded as a parenthesized sequence of comma-separated result entries. Each result entry consists of a label (for named results) or zero-based index number, a colon, and the result value. Result entry ordering must match the function definition.
Zero result values are encoded as () or omitted entirely.
A single result value may be encoded as the "flat" result value itself.

-> ()

-> some("single result")
// or
-> (0: some("single result"))
    
-> (result-a: "abc", result-b: 123)

🌊

wave's People

Contributors

Stargazers

Watchers

Forkers

sunfishcode esoterra fibonacci1729

wave's Issues

Another Multi-line/raw string syntax

I think #21 is workable, and have posted some suggestions for it. I also wanted to file this issue to brainstorm in the direction of BigWave's multi-line/raw string syntax.

> here is a raw string

field:
   > here is a
   > multi-line string that is a value
   > of a record field

For a more complete example, here's an example of the syntax in #21:

  {
    build: %"
      node -c "console.log('hello, world!');"
      echo "foo" > some-file.txt
    %"
  }

and the same code with BigWave strings:

  {
    build:
      > node -c "console.log('hello, world!');"
      > echo "foo" > some-file.txt
  }

Advantages of %":

Doesn't need special comma rules when it appears within eg. a list.
Doesn't need a special end-of-file rule to know whether the file has been truncated.

Advantages of > :

More compact; uses one less line in multi-line cases.
Behavior is insensitive to indentation.
Easily syntax-highlighted with regexes (context).
Can represent any string, including strings where every line starts with whitespace.

I'm not strongly attached to any of the specifics here, I just wanted to brainstorm around this direction.

Keyword ambiguities

WAVE's formal description is currently ambiguous wrt nan and inf which can be interpreted either as float literals (not-a-number and infinity) or as variant case labels.

Additionally, the language description side-steps other similar potential ambiguities with bool (true, false), option (some(...), none) and result (ok(...), err(...)) types by grouping them all under the variant-case rule.

As with #17 the WAVE parser in this repo avoids this ambiguity by being type-driven, but another implementation may want to parse to an unambiguous AST.

Two options:

Remove nan and inf from the number rule and group them under variant-case. This would - annoyingly - leave behind -inf which isn't a valid variant-case.
Reserve some or all of the ambiguous labels as keyword tokens and require %-prefixing for variant labels that use these keywords. Sub-options here:
- Reserve only nan and inf
- Additionally reserve true and false (which could tenuously be argued to be variants)
- Additionally reserve some, none, ok, err (which could less-tenuously be argued to be variants)

Feature idea: sub-tree comments

kdl has a comment syntax /- ("slashdash") which comment out a whole subtree. Translated into Wave, this might look like:

[
   this-is-here,

   // commented-out,

   /- also(
      "commented out!"
   ),
]

These make it easy to comment out a record field or list element or similar with a simple one-line edit.

Add CI

Treat `inf`, `nan`, `true`, `false` as keywords

Currently there's ambiguity with inf/nan between floating-point or variant-case. We could just accept this as a quirk that we resolve using the type, however an advantage of resolving it in the grammar is that type-unaware syntax highlighters would be able to highlight inf, nan, true, and false differently from identifiers.

Consequently, I propose we interpret inf, nan, true, and false as keywords, and require a % when they're intended as identifiers.

Should we do the same for some/none/ok/err? This feels less important, because option/result despecialize to variant anyway. However, they are somewhat special, with special syntax in a few places. I'd be ok going either way on these, but have a slight preference for treating some/none/ok/err as keywords.

Thoughts?

"Empty set" ambiguity

An "empty set" value {} has two possible interpretations: a flags value with zero flags set or a record value with all field values set to none (because "Record entries with the option-typed value none may be omitted").

The WAVE parser in this repo avoids this ambiguity because it is type-driven, but any alternative implementation that wanted to parse to an AST would need to address it.

I see two obvious options:

Disallow {} for records. A record with all fields values set to none would need to specify at least one of those fields explicitly, e.g. {optional-field: none}
Give {} its own formal rule (e.g. empty-set := '{' '}') and accept the AST ambiguity. AST ambiguity is already present for other Component Model value representations such as various number types and bools (vs variant case labels).

Add "native" `Value` type

While this crate is primarily intended for runtime interop with wasmtime, it might be convenient (for tests if nothing else) to have easier-to-construct Value and ValueType enums that parallel wasmtime::component::Value/Type.

Improve parser errors on incorrect type

Currently, when e.g. trying to parse a bool but getting a number, the error is:

expected [Name], got Some(Number)

Should at least include the type currently being parsed, e.g.

error parsing Bool: expected [Name], got Some(Number)

Multi-line/raw strings

This doesn't have existing WIT design to draw from so there are some ideas drawn from other languages:

Repeated delimiter chars allows some armoring against inner delimiters like Rust's r###"..."###
Leading newline and common indent in multi-line strings is stripped, like Python's docstring processing

The rules are intended to make these strings easy for humans to read and write; WAVE encoders generally wouldn't produce them.

Examples

%"raw string; no escape: \n"%

→ "raw string; no escape: \\n"

%"Starting with a non-newline
  preserves indents."%

→ "Starting with a non-newline\n preserves indents."

%"

A single leading newline is stripped."%

→ "\nA single leading newline is stripped."

%"
Trailing newlines are preserved
"%

→ "Trailing newlines are preserved\n"

%"
  Starting with a newline
  strips common indents.
"%

→ "Starting with a newline\nstrips common indents.\n"

%%"Wider delimiters allows "% <- delimiters inside"%%

→ "Wider delimiters allows \"% <- delimiters inside"

Raw string

One or more %s, then ", then any UTF-8 characters until the closing delimiters: " followed by the same number of %s as the opening delimiter:

/(%+)".*?"(\1)/

%"…"% ≡ %%"…"%% ≡ %%%%%"…"%%%%%

Multi-line post-processing

Note: While described as a separate post-processing step for clarity this can be implemented as part of normal parsing.

Newline means "\r\n" or "\n". Space means " ".

When a raw string starts with a newline, multi-line post-processing applies to the string contents:

Skip the leading newline
Skip any leading spaces; the number of spaces skipped is the indent of the string
Loop:
- Skip spaces until:
  - indent spaces have been skipped: continue
  - end-of-string is reached: return the output
  - a non-space character is found: return "invalid indent" error
- Copy characters verbatim into the output until:
  - end-of-string is reached: return the output
  - a newline is found: copy the newline to the output and continue the loop

It might be better to trigger these rules more explicitly than "when a raw string starts with a newline"...

Constrain char/string literal characters

Currently char and string escapes are only required where lexing would otherwise fail: \\ and either \' (char) or \" (string).

In order to avoid human parsing problems, I think a few other escapes should be mandatory:

control characters
- ~~including newline (see multiline string issues)~~ (done)
- should tab be excluded?
possibly some of the more "problematic" unicode characters:
- bidi chars (source of CVEs)
- unicode deprecations
- see https://github.com/peterhuene/wac/blob/b294ae04cd85f12a619db08165b3117b8f977b0d/crates/wac-parser/src/lexer.rs#L46-L82

Multiline strings Ⅲ

This design is adapted from Swift's multiline string syntax.

Examples

"""
A "multiline" string
"""

→

"A \"multiline\" string"

{
  description: """
    Common indent
    is stripped
    """,
  body: """
      Extra indent allowed
     on any content line
    """,
}

→

{
  description: "Common indent\nis stripped",
  body: "  Extra indent allowed\n on any content line",
}

"""

Leading and trailing lines preserved

"""

→

"\nLeading and trailing lines preserved\n"

Description

A multiline string consists of:

the opening delimiter: """\n
any number of content lines: (?<indent> *)[^\n]*\n (ignoring escapes for brevity)
- normal string escape sequences apply; the only mandatory escapes are \\ and any \" necessary to break up a substring of """
- TBD: special-case unindented blank lines?
the closing delimiter: (?<indent> *)"""

The content lines are post-processed: the number of indent spaces in the closing delimiter is considered the indent level of the entire string and stripped from each content line. Content lines are joined with newlines to form the output.

TBD: strip trailing \r?

Follow-up: Raw Strings

This would be a separate (overlapping) feature, allowing both single-line raw strings %"which may not contain newlines"% and:

%"""
  Multi-line strings which do not interpret \ escapes
  and permit """unescaped multiline string delimiters"""
  """%

%rarr;

"Multi-line strings which do not interpret \\ escapes\nand permit \"\"\"unescaped multiline string delimiters\"\"\""

See #21 for previous discussion and #26 for an alternative proposal.

lann / wave Goto Github PK

wave's Introduction

WAVE: Web Assembly Value Encoding

Usage

Encoding

Whitespace

Comments

Keywords

Labels

Bools

Integers

Floats

Chars

Strings

Multiline Strings

Tuples

Lists

Records

Variants

Enums

Options

Results

Flags

Resources

Appendix: Function calls

Function arguments

Function results

wave's People

Contributors

Stargazers

Watchers

Forkers

wave's Issues

Examples

Raw string

Multi-line post-processing

Examples

Description

Follow-up: Raw Strings

Recommend Projects

Recommend Topics

Recommend Org