0xpolygonmiden / miden-vm Goto Github PK

View Code? Open in Web Editor NEW

598.0 21.0 143.0 71.36 MB

STARK-based virtual machine

License: MIT License

Rust 99.97% Makefile 0.03%

cryptography virtual-machine stark zero-knowledge

miden-vm's Issues

Instruction for computing absolute address of a local variable

In some situations it may be useful to get the absolute memory location of a local variable.

For example, let's say we have a procedure for multiplying two 256-bit integers (u256:mul) this procedure expects memory addresses of operands a and b to be on the top of the stack, and after it performs the multiplication, it will leave the result on the top of the stack (we could have passed operands a and b via the stack as well, but for the purpose of this example, we do it via memory).

Thus, to invoke u256:uml the caller needs to put absolute addresses of a and b onto the stack, but if a and b "live" in local memory of the caller, it is not clear how to do that (since we don't have an instruction for accessing the value of fmp directly).

We could add assembly instructions to access fmp and then the caller would be able to compute the absolute address, but i wonder whether having a more specialized instruction is more convenient. I'm thinking of something like this (though open to other name/format suggestions):

push.env.locaddr.i

where i is the index of the local. This instruction would read the fmp, subtract i from it, and push the result onto the stack.

The only downside of this I can think of is that it would enable potential abuse of fmp where a procedure could return absolute address of its local (which would be a logical error).

Hash operation consistency

There are currently some inconsistencies in how we've implemented hashing operations. Specifically:

In the VM, RPPERM to compute permute(A, B, C), expects the stack to be arranged as [A, B, C, ...] where A is at the top of the stack.
In the assembly, rpperm and rphash operations assume that the stack is arranged as [C, B, A, ...] where C is the capacity portion of the sponge and is located at the top of the stack. In case of rphash we assume that once RPPERM is executed, the word A will contain the result of the hash.

One option to address this is to simply fix the implementation of RPPERM in the VM. However, it might be a good idea to have the capacity portion of the sponge deep in the stack (the way it is currently arranged for RPPERM). This way, when we compute a linear hash of many elements, we don't need to touch capacity elements (word C) as we update the outer portion of the sponge (words A and B).

At the same time, the arrangement [A, B, C, ...] is contrary to how we arrange stack for most other operations and might be confusing. For example, as we compute linear hash of many elements, we'd first need to populate the word A and only then the word B with new data - which will be awkward to do since A is at the top of the stack.

The other option is to modify how the Rp64_256 hash function actually works. Specifically, we can change it so that if the state is [A, B, C], it is expected that A is the capacity portion, and B is expected to contain the result of the hash after the permutation.

So, to summarize the options:

Fix RPPERM operation in the VM. This is the simplest option but would make linear hashes of many elements slightly less efficient.
Update assembly instruction rphash to work with the current implementation of RPPERM. This accepts the awkwardness of the current stack arrangement of RPPERM and also would make computing linear hashes of many elements less efficient.
Update Rp64_256 hash function to work as described above, and update RPPERM and related assembly instructions accordingly. The stack would be arranged as: [C, B, A, ...]. This is a more involved option, but perhaps the cleanest and most efficient.

u32 assembly parsers for bitwise ops do not fail for out-of-bounds inputs

As specified by the assembly operations specs, bitwise u32 operations should fail when given inputs >= 2^32.

Currently, the following operations do not fail correctly:

u32and
u32or
u32xor

These errors should be fixed in the in the Bitwise helper in the processor crate.

Memoization of hash execution traces

Currently, Hasher component of the processor always builds traces for hash computations from scratch. This happens even in cases when computing hashes of the same values more than once.

We can improve on this by keeping track of the hashes which have already been computed, and just copying the sections of the trace with minimal modifications. Specifically, the only thing that needs to be updated when computing hash for previously hashed values is the row address column of the trace - everything else would remain the same.

This can also be done at a higher level. For example, we could keep track of sections of a trace used for Merkle path verification and then, if the same Merkle path was verified more than once, we can just copy the relevant sections of the trace (again, with minimal modifications).

Refactor i/o assembly instructions

As discussed in #44 and in #73 we should refactor i/o assembly operations to follow the what.where format and also add new operations for handling memory-based local variables. Overall, the new i/o instruction set should be as follows:

Pushing values onto the stack:

push.xyz - similar to current push but should allow pushing multiple values (up to 8). For example, the following should be valid: push.1, push.1.2, push.1.2.0x3 etc.
pushw.0x... - this should push a single 32-byte value provided in hexadecimal format. Semantically, this should be equivalent to push.a.b.c.d where a is the first 8 bytes of the value, b is the next 8 bytes etc. (although, maybe we should reverse the order?).
push.evn.sdepth
pushw.mem.addr
pushw.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.
push.adv.n where n can be up to 8.

Removing values from the stack:

popw.mem.addr
popw.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.

Overwriting values on the stack:

loadw.mem.addr
loadw.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.
loadw.adv

Save stack values without removing them:

storew.mem.addr
storew.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.

To enable handling locals, we'll also need to update procedure parsers. The new procedure declaration format should include an optional specifier for the number of memory-based locals the procedure can access. For example:

proc.foo.2   # needs access to 2 memory-based locals
proc.bar     # same as proc.bar.0

If code within a procedure tries to access a local outside of the specified bounds, the assembler should return an error.

Stacking of auxiliary traces

The VM has several components traces of which should be "stacked" so that they all share the same set of execution trace columns. These components are:

Hasher
Bitwise processor
Memory controller

All of these components already implement fill_trace() methods which should help with the process. The overall procedure for building a trace is as follows:

First, we need to determine overall length of the execution trace (this would happen when we convert Process struct into the ExecutionTrace here). This length should be computed as be the max of the lengths of all segments of the trace. For the purposes of this issue, we care only about two segments: stack trace and auxiliary table trace.

Stack trace can be retrieved via Stack::trace_length() (which should probably be renamed into trace_len() for consistency). All components of the auxiliary table also expose trace_len(). So, the length should be: max(stack_len, hasher_len + bitwise_len + memory_len). And then we need to pad this length to the next power of two.

Once we have this length, we should allocate 18 vectors of this length, break it into fragments corresponding to lengths of each auxiliary component, and use fill_trace() to fill corresponding segment. We fill the fragments as follows:

First we fill the Hasher fragment. For this fragment values in column 0 are set to ZERO, and the remaining 17 columns contain the actual hasher trace.
Then we fill the Bitwsie fragment. For this fragment values in column 0 are set to ONE and values in column 1 are set to ZERO. The next 13 columns contain the actual bitwise processor trace. The remaining 3 columns are padded with ZERO
Then we fill the Memory fragment. For this fragment values in columns 0 and 1 are set to ONE, values in column 2 are set to ZERO. The next 14 columns contain the actual memory trace, and the last column is padded with ZERO.
If the above doesn't fill all the table completely, we need to pad the remaining rows. For padded rows values in the first 3 columns are set to ONE, and values in remaining columns are set to ZERO.

Even though the above is described sequentially - we can fill all fragments in parallel.

Also, the Bitwise trace is currently missing two columns (so, currently in the code the number of bitwise columns is 11 rather than 13). These are selector columns which we need to identify the bitwise operation being performed.

Procedure context tracking in the processor

Currently, the processor is unaware of the procedure context of an executing program. That is, the processor doesn't know which procedure is being executed, which procedure called the current procedure etc. This limits usefulness of debug info. Thus, it would be very useful if the processor maintained a "procedure stack".

To make a process aware of which procedures are executing, we could introduce two new decorators. Something like:

ProcStart - which the assembler would add at the start of each procedure.
ProcEnd - which the assembler would add right before a procedure returns.

Then, the processor would push a procedure onto a stack when it sees ProcStart decorator in the instruction stream, and remove a procedure from the stack when it sees ProcEnd decorator. This should be very lightweight - so, I don't think we should worry about performance implications.

Potential rationalization of bitwise operations

As I learned from @dlubarov, it is possible to emulate bitwise AND, OR, XOR operations using just one of these operations and some filed element operations. Specifically, assuming we have access to AND, the other two can be described as:

XOR(x, y) = x + y - 2 * AND(x, y)
OR(x, y) = x + y - AND(x, y)

Thus, we could eliminate 2 out of the following 3 operations: U32AND, U32OR, U32XOR. This wouldn't save us much in terms of trace length or constraint complexity, but it would allow us to remove 2 instructions from the instruction set if needed.

If we do go this route, it probably makes sense to keep XOR operation as the one natively supported by the VM and emulate the other two through it.

Potential instruction to help with u32 safety checking

For most u32 operations the "safe" variants are much more expensive than the "unsafe" variations - frequently at least 5x or even 10x more expensive. This might make the safe variants much less desirable to use. One of the reason for this is that asserting that a value is a u32 value is emulated by a sequence of 3 operations: U32SPLIT EQZ ASSERT. Thus, many instructions which work with 2 operands, asserting that the operands are 32 bit values requires 7 or 8 operations.

We could make safe operations much more efficient by introducing U32ASSERT2 operation in the VM. This operation would check whether the top two items on the stack are u32 values and if either one of them is not, return an error. Thus, we'd be able to do most safe operations at just 2x the cost of the unsafe ones.

From AIR constraint standpoint it shouldn't be too difficult to support this - i.e., this would require 4 16-bit range checks, which we need to support for other operations anyway (e.g. u32 mul).

Incorrect implementation of parse_u32div for u32div and u32div.b operations

The implementation of parse_u32div in the assembly u32 ops parser currently has 2 errors:

The parser accepts a parameter of 0 (u32div.0), which means that the operation fails during execution instead of during compilation.
The parser asserts in handle_arithmetic_operation that the top element after the processor's U32DIV op should be zero, which means that all divisions with non-zero remainder fail.

Single-element memory operations

Current assembly instructions for memory access deal with words (4 field elements). This creates quite a bit of over head for use case when we want to read/write a single field element. The overhead is in both places: the VM itself and the assembly instruction sequences needed to accomplish single-element reads/writes.

Adding dedicated operations to read/write single elements at the VM level is possible, but we should probably do it at a later date, once we have a better picture of how frequently these operations are used. However, we could add instructions at the assembly level to help with this. The instructions I'm thinking of are:

push.mem.addr - pushes the first element of the word at addr onto the stack.
push.local.idx - same as above but for a local.
pop.mem.addr- pops the element off the stack and saves it addr.
pop.local.idx- same as above but for a local.

Additional advice sets

Advice sets (src) are used to provided structured non-deterministic inputs to the VM. Currently, the only concretely implemented advice set is a Merkle tree. But anything that meets the requirements of an advice set interface could also work. The requirements of the interface are roughly as follows:

An advice set must have a root.
It should be possible to retrieve a node from the advice set from a given depth and position.
It should be possible to retrieve a Merkle path from a given node to the root.
It should be possible to update a node of an advice set. Any such update should change the value of the root.

Some other options for advice sets include:

Sparse Merkle trees.
Collections of Merkle paths (all of which resolve to the same root).

Standard library v0.1 content

Now that support for standard library imports has landed in #121 we should start discussing what we want to include into the initial release of stdlib. Here are my preliminary thoughts on the library content:

math
- u64 - arithmetic and bitwise operations with 64-bit unsigned integers.
- u128 - not sure if we need this, maybe skip for v0.1?
- u256 - needed to support EVM. we should discuss if we want to try to support all operations in the original release.
crypto
- hashes
  - blake3 - already being worked on in #77
  - sha256 - already being worked on in #123
  - Keccak256- probably defer to future versions.
- sigs (verification only)
  - horst - an instantiation of HORST signature scheme using default hash function of the VM. Could be a good way to test out mtree.get instruction.
  - schnorr - an instantiation of Schnorr signature using the default curve of the VM. Definitely delay till later.
evm - EVM-specific built-ins not covered above. Probably defer to future releases.

Anything else I'm missing?

Standard library organization

Given that there are already things on the horizon which can go into standard library (e.g., #77), I think it might be a good time to think about how to organize standard library.

I'm inclined to keep things as simple as possible (at least at this point) - so, a few assumptions:

Let's not worry about user-defined libraries at this point. We can come back to these later.
Let's assume that standard library is made up only of procedures as described in assembly specs. We can differentiate between library and local procedures at a later point.
Let's assume that the library will relatively small.

The way I'm currently thinking about is that the assembler will load the full standard library at instantiation time. For this, we'll need to build a procedure map (similar to procedure map for local procedures) in the assembler constructor. Then, when a script is being parsed, the assembler can look up a given procedure in this map and inline it as it does with local procedures.

A few preliminary questions to consider:

Should standard library live in assembly crate or its own crate? I'm inclined to think that in its own crate (e.g., stdlib).
What extensions should we use for assembly files? I'm thinking .ma (Miden assembly) - but open to other suggestions.
Should we support module hierarchies? It seems like we should - but what's the simplest way to do it?
- This is probably by far the most complicated part. If there are modules, can some modules import procedures from other modules? How do we handle module resolution? etc.
Should we have something similar to use statements in Rust? Or should standard library procedures be always referred to by fully qualified names? e.g., exec.crypto:hashes:blake3 vs use std:crypto:hashes:blake3 and then in the code exec.blake3.
How does the assembler load the standard library? Would it build one giant string of all the code in the library and parse it in one go? Or would it do so file by file? And if file by file, what determines the order?

Implement handling of u32 operations

We should add support for u32 operations in v0.2 release according to the specs described here.

This will also require updating VM instruction set to support u32 operations.

Optimize the logic for inlining procedures

The way procedures are parsed and inlined causes extra noop operations to be added and can be optimized.

As noted in the specs, the rules for grouping operations inside of span blocks are:

the maximum number of operations per group is 9 (since a group is represented by a single 64-bit field element and each opcode is encoded with 7 bits)
the first group of a batch can start with any operation
for all other groups in a batch (groups 2-8), an operation carrying an immediate value (e.g. push.2) cannot be the first operation in the group. In this case, the VM must insert a noop in front of it.

Currently, it seems that noop insertion is done during procedure parsing and then again during procedure inlining, causing extra unnecessary noop instructions to be added.

Here are some (very contrived) examples to demonstrate:

Example - Procedure Without Locals

Assembly:

    proc.foo
        push.0 push.0 push.0 push.0
        drop drop drop drop
        push.3 push.7 mul
    end
    begin
        push.2 push.3 add
        exec.foo
    end

Compiled:

    begin
        span
            push(2) push(3) add
            pad pad pad pad
            drop drop drop drop
            push(3) noop push(7) mul
        end
    end

The noop is added during the parsing of the procedure where push(7) would be the 10th instruction and would therefore be the first instruction in a group. However, once the procedure is inlined, this noop is no longer necessary, because at this point the second group starts with drop. The batches in the compiled script seem to regroup the operations correctly such that the push(3) and push(7) from the procedure are no longer in different groups, as seen here:

Op Batches:

op_batches: [
        OpBatch { 
          ops: 
            [Push(BaseElement(2)), Push(BaseElement(3)), Add, Pad, Pad, Pad, Pad, Drop, Drop, Drop, Drop, Push(BaseElement(3)), Noop, Push(BaseElement(7)), Mul], 
          groups: 
            [BaseElement(1016745240677138690), BaseElement(2), BaseElement(3), BaseElement(241055074062), BaseElement(3), BaseElement(7), BaseElement(0), BaseElement(0)] 
        }]

Example - Procedure With Locals

Assembly:

        proc.foo.3 
            push.0 push.1 
            storew.local.2 
            add 
            push.0 
            loadw.local.2 
            mul 
            push.0 push.0 push.0 drop drop 
        end 
        begin 
            push.4 push.3 push.2 push.1 
            exec.foo 
        end

Compiled:

        begin 
            span 
                push(4) push(3) push(2) pad incr 
                push(3) fmpupdate 
                pad pad incr 
                push(2) neg fmpadd storew 
                add 
                pad 
                noop push(2) neg fmpadd loadw 
                mul 
                pad pad pad drop drop 
                noop push(3) neg fmpupdate 
            end 
        end

In this example, the first noop is added on line 9 during procedure parsing where the 1st operation of the second group of the procedure operations is push(2) (the first VM instruction for loadw.local.2). Once the procedure is inlined, the placement of this noop is no longer necessary. However, once all operations are combined, the second noop needs to be added, since, after the inlining, the push(2) on line 12 would be the first operation in a group. Notice that without the first unnecessary noop this second noop would not be required.

Update u32 ops proptests in the processor to only test valid cases

Given that proptest may need different sampling for valid vs. invalid inputs, it makes sense to handle these separately.

As discussed, we've decided to handle error cases in manual tests and only test validity in the randomized tests.

This should be updated in the processor u32 tests here

Maybe invalid cases can be handled in manual tests? And the test validity of operations over valid inputs using randomized tests?

Originally posted by @bobbinth in #75 (comment)

Project Contributions

Hi there,

Is there a way to start contributing to Miden ? Excited about the project would love to get involved in some capacity

Best,

Project Status Question

Hi there,

Is Miden ready to start developing on for developers or would it be advisable to wait for a later release ? Excited about the project but just wanted to get a sense of how stable the APIs are and what release version is targeted at new developers.

Thanks so much :)

Best,

Stack cleanup before program exit

One of the conditions we currently impose on program execution is that after a program completes, there must be no items left in the stack overflow table (i.e., stack depth must be 16). It is probably a good idea to have the assembler take care of this. Specifically, at the end of the main script body the assembler could include a call to a procedure which make sure the extra items are dropped from the stack.

This procedure should probably be located in the standard library (maybe in std::sys module?), and it should ensure the following:

The depth of the stack after the procedure executes must be 16.
The top 16 items which were on the stack before the procedure was called must remain on the stack after the procedure exited.

Better handling of decorator operations

Currently, we have a single Operation type which encodes all operations which can be executed on the VM. These operations fall into two main categories:

Regular operations (e.g., ADD, MUL etc.)
Decorators (e.g., DEBUG, ADVICE). As opposed to regular operations, decorators do not advance the clock cycle of the VM, and do not affect the state of the deterministic VM components (e.g., stack, memory).

The idea behind decorators is that we can use this mechanism to provide "hints" to the VM for debugging and other purposes. However, the way the decorators are implemented currently, leads to a couple of undesired consequences.

First undesired consequence is that decorators are relatively "heavy". For example, the current size of an Operation is 48 bytes (ideally, it would be just one byte, though this ideal might be difficult to achieve).

Second undesired consequence is that decorators can affect program hash. Specifically, decorators are placed into span blocks and as long as there is at least one regular operation in the same block, all is fine. But if there are no regular operations, we basically have a block which consists of noops, and there is no way to eliminate it. This should happen relatively rarely - but this edge case is possible, and thus we need to handle it.

To sum up: we want to support decorators, but we want to maintain the following properties:

They should not affect the size of Operation type.
Presence of decorators should not affect program hash.

One way to achieve this is to introduce a separate Decorator type, and have a separate data structure which specifies which decorator is to be executed at a given clock cycles (this would be somewhat similar to how we handle this in v0.1). There might be performance implications to this. Specifically, on each cycle we'd need to check if there are decorators to be executed before executing a regular operation. My hope is that these performance implications would be close to negligible.

However, this still doesn't address the second issue. Consider the following example:

while.true
  <instructions>
end
debug
while.true
  <instructions>
end

The debug decorator here would have to be located in an empty span block, and this would mean that the hash of the program would be different from exactly the same program but without the debug decorator.

One potential solution to this is to also put decorators into other blocks (not just span blocks). But this may add a considerable amount of complexity.

Panic in u32div operation processor

u32div operation processor currently incorrectly handles division by zero. It panics, but instead should return ExecutionError::DivideByZero.

We should also update relevant u32tests to expect an error instead of a panic.

unexpected stack underflow with u32split

TL;DR:

There seems to be an issue in one of the 2 following places:

a problem with how the stack depth is managed for operations that output more elements than they take as inputs
a bug in the check_depth requirement in op_u32split + a lack of documentation about stack requirements for that op

Description

According to the assembly ops specs here, which specify a stack input for u32cast of [a, ...] and stack output of [b, ...] I expected the following minimal execution test to pass, but it fails with a stack underflow error.

#[test]
fn u32cast() {
    let script = compile("begin u32cast end");

    let inputs = build_inputs(&[1]);
    let trace = execute(&script, &inputs).unwrap();
    let last_state = trace.last_stack_state();
    let expected_state = build_stack_state(&[1]);

    assert_eq!(expected_state, last_state);
}

Essentially what's happening is that when the processor does a u32split & attempts to set stack[1] with the low 32-bit values, it panics, because the current (and required) stack depth is only 1. If I add an extra input to the stack then everything works fine, and that input isn't touched.

Relevant backtrace:

stack backtrace:
   0: rust_begin_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:100:14
   2: miden_processor::stack::Stack::set
             at ./src/stack/mod.rs:99:9
   3: miden_processor::operations::u32_ops::<impl miden_processor::Process>::op_u32split
             at ./src/operations/u32_ops.rs:19:9

Question

This seems to be a bug either in the way the stack depth is managed for operations that output more elements than they take as inputs or a bug in the check_depth assertion in op_u32split. What gave me pause was that in the u32split unit test in the processor you're initializing the stack with 2 values, which suggests that maybe the current behaviour is intentional but wasn't documented yet (or is in a place I haven't seen).

What is supposed to be happening here?

u64 module for stdlib

As mentioned #125, we should add a module for 64-bit unsigned integer operations to the standard library. Initially, it should probably include the following procedures:

add
sub
mul
div
mod,
eq
eqz
lt
lte
gt
gte
and
or
xor
shl - might be difficult without #126
shr - might be difficult without #126

Also, for most of these (excluding bitwise operations), we probably should do the _unsafe variants.

Debug operation

We need to implement the debug operation according to the specs described here. Implementing this operation would require:

Implementing an options struct for the Debug operation enum here.
Implement parsing of the assembly instruction (will need to be added here).
Implement processing the Debug operation here.

Machine architecture for Miden VM

Miden VM is currently a stack machine, and while we are fairly far down the path of implementing it as such, it's worth listing pros and cons of different architectures, and taking another look at the alternatives.

I have the following architectures in my consideration set:

Stack machine: all instructions work with the top of the stack, and the stack can grow fairly large (unlimited, for all practical purposes).
Pseudo-stack machine: has a fixed set of registers (say 16 or 20) which behave similar to a stack - i.e., instructions still work with the top of the stack, but when items "fall off" the stack they are lost forever.
Register machine: instructions works with a set of general purpose registers - i.e., instructions read from one or more registers and write to one (or more) registers.
Memory machine: instructions read and write directly from memory.

With all of the above, we assume that there is also random access memory and instructions can move data btween stack/registers and memory.

The sections below attempt to list pros and cons of each approach, though, the lists are probably not going to be exhaustive.

Stack machine

Pros:
- Compact instruction format: instructions can be encoded with very few bits. Specifically, we need just 6 or 7 bits to encode a single instruction. This means we can pack 9 or 10 instructions into a single field element, and this greatly improves efficiency of computing program hash (a single permutation of Rescue prime can hash 70 - 80 instructions).
- Variable instruction inputs: it is fairly easy to have many different instruction formats, and have instructions which consume lots of inputs. For example, we have MPVERIFY instruction which consumes 10 field elements as inputs and MRUPDATE instruction which takes 14 field elements as inputs.
- Relatively easy handling of trusted procedures: when executing trusted code, we can be sure that the code will consume items at the top of the stack and items deep in the stack will still be there when the procedure returns. This also makes passing parameters to and reading return values from procedures pretty easy (everything is always at the top of the stack).
Cons:
- Stack manipulation instructions: we need to continuously move data at the top of the stack so that it is arranged just right for a given instruction. This leads to two cons:
  - First, we need to have lots of instructions which manipulate the stack. In the current instruction set, almost half of all instructions are there to move data on the stack.
  - Second, the programs become longer because a big part of each program is taken up by stack manipulation. It doesn't matter too much in terms of program size (given how compact the instructions are, most programs are still probably going to be smaller than with other alternatives), but it does waste VM cycles.
- Overflow table: since we can't keep all of the stack in registers, we need to have an overflow table. This table adds some complexity to AIR constraint and also adds about 8 - 10 extra columns to the execution trace (this is roughly 10% of the total number of columns).
- Checking instruction validity: when executing instructions we need to keep track of the stack depth and make sure instructions fail if the depth is not sufficient for a given instruction. This also creates some overhead (in the naive approach, we need to waste a range-check for every instruction, though, there are better, albeit more complicated approaches to handle this).
- Single-item shifts: to keep the AIR constraints manageable, we can shift the stack by at most one item per cycle.
- Relatively difficult handling of untrusted procedures: with untrusted code, we need to make sure that invoked procedure doesn't overwrite the portion of the stack it is not supposed to touch. This can be done by adding various guards to indicate how far down the stack can an invoked procedure access - but this adds complexity.

Pseudo-stack machine

Pseudo-stack machine inherits a lot of the pros and cons of the stack machine, but there are some differences.

Pros:
- Compact instruction format: we still get the same compact instruction format.
- Variable instruction inputs: we can still have many different instruction formats.
- No overflow table: we don't need to worry about overflow table and this reduces complexity and makes the VM a bit more efficient.
- Multi-item shifts: we can shift the stack by multiple items per cycle. For example, we could have a single DROPW instruction to drop top 4 items of the stack.
- No need to track stack depth: we can assume that all registers are always initialized to some values (e.g., initially hold zeros) - so, no need to check if the right number of operands is present on the stack. This also allows us to use instructions in more contexts. For example, we can always use SWAPW3, even if we don't have all 16 items on the stack.
Cons
- Stack manipulation instructions: we still have to deal with stack manipulations, though, being able to use large SWAP instructions makes this a tiny bit better.
- Relatively difficult handling of procedures: since items may "fall off" the stack never to return, we can't be sure that even a trusted procedure won't result in some data loss. Thus, we either need to save the data we care about into memory before invoking procedures, or have some way to provide guarantees that a given procedure won't shift the stack by more than n items during its execution.
  - However, if we can solve this for trusted code, we should be able to use the same approach for untrusted code as well.

Register machine

Here I'm assuming we have a set of general purpose registers (probably between 8 and 16), and instructions can reference any of these registers directly.

Pros:
- Relatively concise programs: there is no need for stack manipulation instructions, which can save a lot of VM cycles.
Cons:
- Small number of instruction formats: we can probably have a few instruction formats, but most instructions will read values from one or two registers, preform an operation, and put the result into a third register. This will make supporting complex instructions (e.g. MPVERIFY) rather complicated.
- Bigger instructions: we'll need to dedicate 3 - 4 bits per register reference, and thus, instruction size would need to be around 20 bits (or more). Thus, we'll be able to fit at most 3 instructions into a single field element.
- Parameter passing: passing parameters to/from procedures/functions will be more complicated as compared to the stack approach (e.g., there will be a need for managing stack frames either in-memory or in some other ways).

Memory machine

Since memory access is rather cheap in zkVM context, we can do away with registers entirely and work directly with memory. As with register machines, I'm assuming that each instruction can read values from one or two memory locations and write the result of the operation into a third memory location.

Pros:
- Concise programs: there is no need for stack manipulation instructions and no need to move values between memory and stack/registers. This can save even more VM cycles as compared to register machine approach.
Cons:
- Small number of instruction formats: similar to register machine, we can probably have just a handful of instruction formats, and handling complex instructions which work with a lot of values becomes rather complicated.
- Bigger instructions: we'll need to dedicate 16 bits or so per memory reference (these could be offsets relative to some updatable value), and thus, we can have at most 1 instruction per element.
- Parameter passing: passing parameters to/from procedures/functions will be more complicated as compared to the stack approach (e.g., there will be a need for managing stack frames either in-memory or in some other ways).

Power of two instruction

The VM currently does not support dynamic bitwise shifts. One of the reasons for this is that emulating shifts using other operations is relatively easy when we know how many bits to shift by statically. For example, u32shl is actually computed as PUSH(2^x) MUL U32SPLIT DROP where x is the number of bits to shift by and 2^x is computed at compile time.

One way to support dynamic shifts of u32 values (which can be used to support dynamic shifts of u256 values), is to compute 2^x in the VM where x is a value on the stack. The operation could be called U32POW2 (or something similar), and could work as follows:

[x, ... ] -> [2^x, ...]

This, of course, can be computed via repeated multiplication by two, but that would be rather expensive, and we should find a better way of supporting this operation.

One thing to note: even if we do have support for this operation, implementing u256 bitwise shifts is going to be far from trivial. So, other approaches to bitwise shifts may need to be explored.

swapdw assembly instruction

As mentioned in #65, we might introduce swapdw (swap double word) operation at the VM level at some point. It might be a good idea to have this operation supported in the assembly even before then. This operation would be useful for working with u256 values (as each u256 value is represented with 8 32-bit limbs).

Semantics of the operation would be as follows:

[a7, a6, a5, a4, a3, a2, a1, a0, b7, b6, b5, b4, b3, b2, b1, b0, ... ] ->
[b7, b6, b5, b4, b3, b2, b1, b0, a7, a6, a5, a4, a3, a2, a1, a0, ... ]

For now, we can implement this using our existing SWAPW operations.

Consider reformat of pushw assembly instruction

As discussed in the PR implementing parsing of pushw, it may make sense to allow the pushw instruction to take a single 32-byte hex value either instead of or in addition to the current approach of taking 4 inputs of 1 element each.

This is inspired by this contrived but strange example with the current implementation:
let op = Token::new("pushw.1.23.0x1C8.0", 0);

From Bobbin:
"now that I'm seeing how this actually looks, I wonder if we should change the format of pushw instruction to always take a single hexadecimal value of 32 bytes. Or maybe making this as an additional allowed format. So, for example, possible formats could be:

pushw.1.2.3.4
pushw.0x13244... (32 bytes)"

Implement handling of field operations

Currently, the assembler handles only a limited set of field operations. We should expand it to support all operations mentioned in the specs.

This would also require updating VM instruction set to support missing operations.

Implement comment parsing

Currently, the assembler does not support comments in the source code. We should implement parsing of comments according to the spec described here.

Program analysis tool

It would be cool if we had a way to analyze a given script. This would let us make informed judgements about scripts we may have in the standard library. Some interesting statistics that come to mind:

Number of VM cycles it takes to execute the script.
Number of program blocks in the script (for each block type).
Number of cycles per assembly instruction. E.g., a program executed u32add.full operation x times, and this translated to y VM cycles.

Program analysis task list (grjte edit):

Implement cryptographic operations

We should implement cryptographic operations in v0.2 assembler according to the specs described here.

This will also require updating VM instruction set to support missing operations.

Potential optimization for loading values from memory

Currently, at the VM level we have a single loadw operation which is used to load data from memory onto the stack. This operation is used as the basis for loadw.mem and pushw.mem assembly instructions. Specifically, loadw.mem translates directly to loadw operation, while pushw.mem pads the stack with zeros before executing loadw.

Thus, loadw.mem requires 1 VM cycle, while popw.mem required 6 VM cycles. We can make these a little more balanced by changing the semantics of loadw processor operation. Specifically, currently stack transition diagram for loadw looks as follows:

[addr, 0, 0, 0, 0, .... ] -> [a3, a2, a1, a0, ...]

where, addr is the memory address from which to load a word, and a0, a1 etc. are elements of the word located at that address. The net result is that the stack is shifted the left by one.

With the new approach we could do the following:

[addr, 0, 0,  .... ] -> [a3, a2, a1, a0, ...]

The net result would be that the stack is shifted to the right by one.

With this approach both loadw.mem and pushw.mem instruction can be translated to 4 VM operations:

loadw.mem: MOVDN2 DROP DROP LOADW
pushw.mem: PAD PAD MOVUP2 LOADW

The net effect is that we reduce cycle count of pushw.mem by 2 cycles while increasing cycle count for loadw.mem by 3 cycles. Thus, whether this optimization is worth it should be dictated by relative prevalence of pushw.mem vs. loadw.mem in real-world programs.

Management of local variables in memory

As described in the assembly specs, Miden VM has word-addressable read-write memory. There are a few load and store instructions which are responsible for moving data between the memory and the stack. These instructions take memory address from the top of the stack.

From code organization standpoint, Miden assembly supports procedures and (will support) functions. The difference between the two is as follows:

Procedures are in-lined into the program MAST at compilation time. Thus, semantically, procedures don't differ from the surrounding code. This, for example, means that they can access the same memory as surrounding context.
Functions will be used to represent "untrusted" procedures (one use cases for these is cross-contract calls). A called function won't have access to the caller's memory and vice versa (this also implies that all parameters are passed by value between functions via the stack, but that's a separate topic).

In the context of managing local variables, we have the following situations to consider:

Local variables for simple procedures (simple here means that all local variables can live on the stack).
Local variable for complex procedures.
Local variables for functions.

Handling local variables for simple procedures is straight forward - all variables live on the stack and the procedure doesn't need to allocate memory for storing locals. Similarly, handling local variables for functions is straight forward: each function gets its own memory and can place variables w/e it sees fit.

However, for complex procedures which can't fit all locals onto the stack, it is not immediately clear where they should store locals in memory to make sure one procedure doesn't overwrite locals of another procedure. There are several approaches to handle this.

Option 1

We can do something similar to what the EVM does: have a convention for putting the "free memory pointer" (fmp) somewhere in memory (in our case, probably at address 0x0) and then use this free memory pointer to indicate where a procedure could put its locals.

This is the simplest approach in terms of implementation (we don't need to change anything at the VM level), but it would introduce quite a bit of runtime overhead. Basically, each time we need to read or write a local variable to memory, we'd need to read and possibly overwrite fmp value. Thus, in the worst case, each access to a local variable may require 2 - 3 memory accesses and a bunch of instructions in-between. Overall, I'd expect moving local variables between stack and memory to take up dozens of VM cycles with this approach.

Option 2

We can still store fmp at address 0x0, but we would modify the semantics of load and store operations to implicitly read from this address, add the value at the top of the stack to the fmp and that would be the memory address for the operation. Basically, the memory address would be computed as: mem[0] + stack[0], but this would be done implicitly.

This would reduce the number of cycles we'd have to expand, but we would still need to read write fmp from memory which would add rows to the memory trace.

Option 3

We can have a dedicated register for fmp and would modify the semantics of load and store instructions to compute addresses as fmp + stack[0] (basically, top of the stack provides an offset relative to fmp). The value of fmp would be set to 0 at the start of the program.

The advantage of this approach is that we drastically cut down on the number of cycles consumed by load and store operations, and each load/store still translates to a single memory operation. The downsides are twofold:

We need to allocate extra register for fmp which would only be used for the purposes of load/store operations. I don't think this is a big deal.
We need to have a way to set and (possibly) get the value in fmp for this we would need to have 1 or 2 dedicated instructions.

The reason I said possibly for getting the value of fmp is because we might be able to get away with just setting the value using relative offsets. For example, if a procedure needed to allocate 3 words of memory for its locals it could do something like this:

push.3 setfmp

This would be equivalent to fmp ← fmp + 3. And then the procedure could store its locals at addresses fmp, fmp - 1, and fmp - 2.

Once the procedure is done, it could reset the value in fmp as follows:

push.3 neg setfmp

This would be equivalent to fmp ← fmp - 3. We could also change the format of push instruction to accept negative values to make this a bit more ergonomic.

Simple procedures which don't need to store locals in memory won't need to touch values in fmp at all.

One major drawback of options 2 and 3 is that managing absolute memory pointers becomes very tricky (and maybe even impossible). So, probably more thinking is needed here.

Question on Memory Addressing Scheme

Hi,

I was reading Miden Assembly specification's RAM addressing modes

In context of function local memory and absolute index based memory addressing, I was wondering how do I handle following depicted scenario.

# Imagine I've following Miden Assembly code, where I'd like to use local memory ( reserved while
# invoking `fn0` ) in `fn1` procedure, which is invoked from `fn0`
#
# i ) But following program instead accesses memory locations by absolute indices [ not desired ]
# ii ) If I declare `fn2`, instead of `fn1`, then it reserves another 2 VM words; lets me work with that [ not desired ]
#
# Question: How do I refer `fn{1,2}` to use local memory reserved while invoking their parent function `fn0` ?

    proc.fn2.2
        pushw.local.0
        pushw.local.1
    end

    proc.fn1.0
        pushw.mem.0
        pushw.mem.1
    end

    proc.fn0.2
        push.3.2.1.0
        push.7.6.5.4

        popw.local.1
        popw.local.0

        exec.fn1
    end

    begin
        exec.fn0
    end

Incorrect doc comments in the processor

The sys_ops file in the processor is mis-commented - it has i/o and push comments instead.

I'd be happy to fix if you want, but I know there's a lot of active work in the processor right now.

Range checks for u32 operations

As described in u32 operations note, these operations need to rely on a set of helper registers. This is not yet implemented because the helper registers will actually be located in the decoder which is currently not implemented.

Some of the helper registers are needed to perform 16-bit range checks. And while we can't yet populate these registers with needed values, we can make lookups into the RangeChecker for for these value (this can be done via RangeChecker::add_value() method).

At the high level, there are two things we should do:

For u32 operations which require range checks, we should make sure we add the required values to the RangeChecker. It might make sense to have some helper functions to handle this rather than duplicating this lookup code in every operation handler.
Once the above is done, executing u32 operations would result in the RangeChecker trace being populated. We should integrate this trace into the overall execution trace similar to how we did this for co-processor execution traces.

Step-by-step program execution

For debugging purposes, it would be great to have a way to execute a program one cycle at a time, and to be able to read the state of the VM at every cycle. One way to do this is described below:

First, we could create a separate function in the processor crate (where the execute() function is defined) to handle step-by-step execution. This function could be called execute_iter() and it would return an iterator over process states (we'll also need to define a struct which represents a process state).

Internally, execute_iter() could do the same thing as execute() but instead of converting a Process into an ExecutionTrace, it would iterate over the data contained in the process and return it for a given cycle. Basically, we would execute the program to the end (or until there is an error), and then start reading generated states one-by-one as a part of iteration process. This would be relatively "wasteful", but I think that for debug purposes its fine.

As mentioned above, the Process struct already contains most of the historical data that we'd need (e.g., it keeps track of how memory changed with every cycle). There are two things for which it keeps only the latest state (rather than all historical states):

Stack overflow table.
Advice provider.

I think we can tackle advice provider part at a later time, but it would be good to come up with a way to keep historical states of the stack overflow table. This will probably be relatively heavy - so, this functionality should be optional. Meaning, we'd use it only when executing programs from execute_iter() context, but not when using the regular execute() context.

Make order of input arrays consistent across all test modules

Currently, some test modules take input arrays in order while other modules require the inputs to be in the order they would be in after being pushed onto the stack, which is the reverse order.

We should make the order of input arrays consistent across all tests. Specifically, all test input arrays should be provided with data in order (not stack order), and we can update the test helper functions as required.

Changes are required in the following test modules:

All processor integration tests (not unit tests):

processor crypto_ops tests
processor field_ops tests
processor flow_control tests
processor u32 ops tests

Implement stack manipulation operations

We should implement stack manipulation operations in v0.2 assembler according to the specs described here.

The basic idea is that we already have parsers stubbed out for stack manipulation instructions here and we also have the VM instructions for stack manipulation defined here. Now, we need to map each assembly instruction to one or more VM instructions.

For example, on the assembly side, we have movupw.n instructions but there are no such instructions on the VM side. However, we can use SWAPW instructions to simulate movupw.n instructions as follows:

movupw.2 -> SWAPW SWAPW2
movupw.3 -> SWAPW SWAPW2 SWAPW3

For additional references see how parsing of field operations is implemented here.

u32 assembly parsers for shift/rotate ops fail with legal parameter value 0

The shift and and rotation operations in the assembly parsers are enforcing parameter bounds on the bit shift value of 1 <= b <= 31, but according to the assembly specs parameter value b = 0 should be allowed.

This should be fixed in parse_u32shl, parse_u32shr, parse_u32rotl, parse_u32rotr in the Assembly ops parsers.

Establish consistent pattern for testing exact errors

As discussed in PR #40 , we want tests to be checking for exact errors according to Rust best practices.

TL;DR: there are different options and no perfect consensus, so we should make an opinionated decision

Background

It seems like Rust actually doesn't handle error testing as ergonomically as most other things, so there isn't an opinionated "right answer." Most errors in Rust implement the PartialEq trait, which allows using assert_eq! in tests to check exact errors. However, a few kinds of errors don't, most notably std::io::Error. See the reference discussions below for more detailed reasoning, but it can be summarized by "It's an unfortunate ergonomic loss that io::Result instances can't be equated, but indicating that io::Error has an implementation of PartialEq is somewhat of a lie because the only type of error that can be equal is two OS errors"

Possible approaches to testing errors

Option 1: Implement PartialEq trait for custom errors

The info above suggests that either deriving PartialEq (where possible) or manually implementing it for custom error types and then using assert_eq! is an acceptable idiomatic option (as recommended in the referenced Stack Overflow discussion as well).

Option 2: Use `should_panic` or return a `Result<T,E>` from the test

The book presents 2 options:

use should_panic with an expectation string
give the test a Result return type and return the expected Error

This is possibly the "best practice" given that it's in the book, but it's not always fine-grained enough & many people seem dissatisfied with it. Additionally, both options mean that there will be strictly one error checked per unit test, which makes sense with the idea of a "unit test" but could cause a blowup in the number of tests, which may or may not be desirable (needs an opinionated decision).

should_panic can be improved upon by using catch_unwind which "allows you to put the code you expect to panic into a closure, and only panics emitted by that code will be considered expected (i.e. a passing test)." See the second SO reference for a discussion & links to examples of this approach being used in the Rust standard library. This approach cannot catch non-unwinding panics.

Final note: using should_panic can give different output depending on how it's executed. "As you can see, the output of the second should_panic test case contains a backtrace. However, if you run your tests with the simple cargo test <test_name> command, the output of the both test cases are the same. So, be careful when you configure your CI/CD system!" (see the reference from Yury Zhauniarovich)

Option 3: Use `assert_matches` crate

We could use this crate to test pattern matching. It gives more control than the second option without the overhead of implementing PartialEq. This was mentioned in the Rust discussions referenced below, but I haven't seen it mentioned otherwise.

References:

unit testing errors (PartialEq):
https://stackoverflow.com/questions/57234140/how-to-assert-io-errors-in-rust
testing with should_panic (catch_unwind):
https://stackoverflow.com/questions/26469715/how-do-i-write-a-rust-unit-test-that-ensures-that-a-panic-has-occurred
article on ins & outs of testing errors:
https://zhauniarovich.com/post/2021/2021-01-testing-errors-in-rust/

Discussions with the Rust team:

rust-lang/rust#34192
rust-lang/rust#34158

Move integration tests into a separate crate

We currently put all of the integration tests into the processor crate. This may not be the best place for them as the test a bunch of things besides validity of processor execution (e.g., assembly compilation, correctness of stlib procedures). In the future, we'll probably add more integration tests which go beyond just the processor (e.g., testing proof generation/verification process).

It might be a good idea to move integration tests into a separate crate to make the overall project structure cleaner and more intuitive.

u32testw parser does not output correct VM instruction sequence

As mentioned in the comment here current implementation of u32testw assembly instruction parser (parse_u32testw()) produces an incorrect instruction sequence. We could fix it as suggested in the comment, but there might be a better way.

In general, we want u32testw to leave 1 at the top of the stack if all 4 top stack elements are u32 values and 0 otherwise. Thus, we can test if a given element is a u32 (leaving 1 at the top of the stack if it is) and then use AND operation to combine these values. Thus, we don't need to have NOT operations after EQZ and then we can replace OR operations with AND. This should also reduce the cycle count needed for this operation by 4.

So, for example, to test the first element, we'd do:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);

Instead of the current:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);
span_ops.push(Operation::Not);

And for the second, we could do:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);
span_ops.push(Operation::And);

Instead of the current:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);
span_ops.push(Operation::Not);
span_ops.push(Operation::Or);

Local procedure memory overwrites memory from outer context

Tests added in #117 demonstrate how local procedure memory currently overwrites memory from the outer context.

    let script = compile(
        "
        proc.bar.2
            pop.local.0
            pop.local.1
        end
        begin
            pop.mem.0
            pop.mem.1
            pop.mem.2
            pop.mem.3
            exec.bar
            push.mem.3
            push.mem.2
            push.mem.1
            push.mem.0
        end",
    );
    let inputs = [1, 2, 3, 4, 5, 6, 7];

    test_script_execution(&script, &inputs, &[7, 6, 5, 4, 1]);

Instead of the expected final stack looking like [7, 6, 5, 4, 1, ...], the resulting stack is [7, 2, 3, 4, 1, ...], indicating that absolute memory at addresses 1 and 2 was overwritten by the locals at 1 and 0 respectively.

It looks like the reason for this is that fmp is only ever updated in the op_fmpupdate function, so local procedure memory for procedures called directly from the main context is always offset from 0.

Inconsistent assembly errors in the assembly parser for malformed operations

In validate_op_len we return AssemblyError::invalid_op when any part of the instruction is missing. Examples of this would be:

calling parse_adv with "adv" which is missing the push.n or loadw portion
calling parse_push with ""

In many of the parsers for the u32 operations (and possibly other places) we return AssemblyError::missing_param(op) in these cases. An example of this is calling parse_u32div with the op ""

Obviously the case with the u32 ops is trivial, but for consistency I think we should pick one error to use in all equivalent cases when the base instruction is malformed. Thoughts?

Better comment format

Currently, comments in Miden assembly must be surround by # characters. Thus, even for single line comments we must write:

# this is a comment #

It is a bit annoying that we need to terminate single line comments with #. The main motivation for doing so was ease of parsing, and in fact parsing this style of comments is very easy (they are stripped out from the source code as implemented here. But maybe there are fairly simple ways to strip out comments which terminate with a newline instead of # as well?

Implement input/output operations

We should implement input/output operations according to the specs described here.