0xpolygonmiden / miden-vm Goto Github PK

View Code? Open in Web Editor NEW

618.0 23.0 155.0 77.5 MB

STARK-based virtual machine

License: MIT License

Rust 99.83% Makefile 0.14% Shell 0.04%

cryptography virtual-machine stark zero-knowledge

miden-vm's Introduction

Miden Virtual Machine

A STARK-based virtual machine.

WARNING: This project is in an alpha stage. It has not been audited and may contain bugs and security flaws. This implementation is NOT ready for production use.

Overview

Miden VM is a zero-knowledge virtual machine written in Rust. For any program executed on Miden VM, a STARK-based proof of execution is automatically generated. This proof can then be used by anyone to verify that the program was executed correctly without the need for re-executing the program or even knowing the contents of the program.

If you'd like to learn more about how Miden VM works, check out the documentation.
If you'd like to start using Miden VM, check out the miden crate.
If you'd like to learn more about STARKs, check out the references section.

Status and features

Miden VM is currently on release v0.10. In this release, most of the core features of the VM have been stabilized, and most of the STARK proof generation has been implemented. While we expect to keep making changes to the VM internals, the external interfaces should remain relatively stable, and we will do our best to minimize the amount of breaking changes going forward.

The next version of the VM is being developed in the next branch. There is also a documentation for the latest features and changes in the next branch documentation next branch.

Feature highlights

Miden VM is a fully-featured virtual machine. Despite being optimized for zero-knowledge proof generation, it provides all the features one would expect from a regular VM. To highlight a few:

Flow control. Miden VM is Turing-complete and supports familiar flow control structures such as conditional statements and counter/condition-controlled loops. There are no restrictions on the maximum number of loop iterations or the depth of control flow logic.
Procedures. Miden assembly programs can be broken into subroutines called procedures. This improves code modularity and helps reduce the size of Miden VM programs.
Execution contexts. Miden VM program execution can span multiple isolated contexts, each with its own dedicated memory space. The contexts are separated into the root context and user contexts. The root context can be accessed from user contexts via customizable kernel calls.
Memory. Miden VM supports read-write random-access memory. Procedures can reserve portions of global memory for easier management of local variables.
u32 operations. Miden VM supports native operations with 32-bit unsigned integers. This includes basic arithmetic, comparison, and bitwise operations.
Cryptographic operations. Miden assembly provides built-in instructions for computing hashes and verifying Merkle paths. These instructions use the Rescue Prime Optimized hash function (which is the native hash function of the VM).
External libraries. Miden VM supports compiling programs against pre-defined libraries. The VM ships with one such library: Miden stdlib which adds support for such things as 64-bit unsigned integers. Developers can build other similar libraries to extend the VM's functionality in ways which fit their use cases.
Nondeterminism. Unlike traditional virtual machines, Miden VM supports nondeterministic programming. This means a prover may do additional work outside of the VM and then provide execution hints to the VM. These hints can be used to dramatically speed up certain types of computations, as well as to supply secret inputs to the VM.
Customizable hosts. Miden VM can be instantiated with user-defined hosts. These hosts are used to supply external data to the VM during execution/proof generation (via nondeterministic inputs) and can connect the VM to arbitrary data sources (e.g., a database or RPC calls).

Planned features

In the coming months we plan to finalize the design of the VM and implement support for the following features:

Recursive proofs. Miden VM will soon be able to verify a proof of its own execution. This will enable infinitely recursive proofs, an extremely useful tool for real-world applications.
Better debugging. Miden VM will provide a better debugging experience including the ability to place breakpoints, better source mapping, and more complete program analysis info.
Faulty execution. Miden VM will support generating proofs for programs with faulty execution (a notoriously complex task in ZK context). That is, it will be possible to prove that execution of some program resulted in an error.

Compilation to WebAssembly.

Miden VM is written in pure Rust and can be compiled to WebAssembly. Rust's std standard library is enabled as feature by default for most crates. For WASM targets, one can compile with default features disabled by using --no-default-features flag.

Concurrent proof generation

When compiled with concurrent feature enabled, the prover will generate STARK proofs using multiple threads. For benefits of concurrent proof generation check out benchmarks below.

Internally, we use rayon for parallel computations. To control the number of threads used to generate a STARK proof, you can use RAYON_NUM_THREADS environment variable.

Project structure

The project is organized into several crates like so:

Crate	Description
core	Contains components defining Miden VM instruction set, program structure, and a set of utility functions used by other crates.
assembly	Contains Miden assembler. The assembler is used to compile Miden assembly source code into Miden VM programs.
processor	Contains Miden VM processor. The processor is used to execute Miden programs and to generate program execution traces. These traces are then used by the Miden prover to generate proofs of correct program execution.
air	Contains algebraic intermediate representation (AIR) of Miden VM processor logic. This AIR is used by the VM during proof generation and verification processes.
prover	Contains Miden VM prover. The prover is used to generate STARK proofs attesting to correct execution of Miden VM programs. Internally, the prover uses Miden processor to execute programs.
verifier	Contains a light-weight verifier which can be used to verify proofs of program execution generated by Miden VM.
miden	Aggregates functionality exposed by Miden VM processor, prover, and verifier in a single place, and also provide a CLI interface for Miden VM.
stdlib	Contains Miden standard library. The goal of Miden standard library is to provide highly-optimized and battle-tested implementations of commonly-used primitives.
test-utils	Contains utilities for testing execution of Miden VM programs.

Performance

The benchmarks below should be viewed only as a rough guide for expected future performance. The reasons for this are twofold:

Not all constraints have been implemented yet, and we expect that there will be some slowdown once constraint evaluation is completed.
Many optimizations have not been applied yet, and we expect that there will be some speedup once we dedicate some time to performance optimizations.

Overall, we don't expect the benchmarks to change significantly, but there will definitely be some deviation from the below numbers in the future.

A few general notes on performance:

Execution time is dominated by proof generation time. In fact, the time needed to run the program is usually under 1% of the time needed to generate the proof.
Proof verification time is really fast. In most cases it is under 1 ms, but sometimes gets as high as 2 ms or 3 ms.
Proof generation process is dynamically adjustable. In general, there is a trade-off between execution time, proof size, and security level (i.e. for a given security level, we can reduce proof size by increasing execution time, up to a point).
Both proof generation and proof verification times are greatly influenced by the hash function used in the STARK protocol. In the benchmarks below, we use BLAKE3, which is a really fast hash function.

Single-core prover performance

When executed on a single CPU core, the current version of Miden VM operates at around 20 - 25 KHz. In the benchmarks below, the VM executes a Fibonacci calculator program on Apple M1 Pro CPU in a single thread. The generated proofs have a target security level of 96 bits.

VM cycles	Execution time	Proving time	RAM consumed	Proof size
2¹⁰	1 ms	60 ms	20 MB	46 KB
2¹²	2 ms	180 ms	52 MB	56 KB
2¹⁴	8 ms	680 ms	240 MB	65 KB
2¹⁶	28 ms	2.7 sec	950 MB	75 KB
2¹⁸	81 ms	11.4 sec	3.7 GB	87 KB
2²⁰	310 ms	47.5 sec	14 GB	100 KB

As can be seen from the above, proving time roughly doubles with every doubling in the number of cycles, but proof size grows much slower.

We can also generate proofs at a higher security level. The cost of doing so is roughly doubling of proving time and roughly 40% increase in proof size. In the benchmarks below, the same Fibonacci calculator program was executed on Apple M1 Pro CPU at 128-bit target security level:

VM cycles	Execution time	Proving time	RAM consumed	Proof size
2¹⁰	1 ms	120 ms	30 MB	61 KB
2¹²	2 ms	460 ms	106 MB	77 KB
2¹⁴	8 ms	1.4 sec	500 MB	90 KB
2¹⁶	27 ms	4.9 sec	2.0 GB	103 KB
2¹⁸	81 ms	20.1 sec	8.0 GB	121 KB
2²⁰	310 ms	90.3 sec	20.0 GB	138 KB

Multi-core prover performance

STARK proof generation is massively parallelizable. Thus, by taking advantage of multiple CPU cores we can dramatically reduce proof generation time. For example, when executed on an 8-core CPU (Apple M1 Pro), the current version of Miden VM operates at around 140 KHz. And when executed on a 64-core CPU (Amazon Graviton 3), the VM operates at around 250 KHz.

In the benchmarks below, the VM executes the same Fibonacci calculator program for 2²⁰ cycles at 96-bit target security level:

Machine	Execution time	Proving time	Execution %	Implied Frequency
Apple M1 Pro (16 threads)	310 ms	7.0 sec	4.2%	140 KHz
Apple M2 Max (16 threads)	280 ms	5.8 sec	4.5%	170 KHz
AMD Ryzen 9 5950X (16 threads)	270 ms	10.0 sec	2.6%	100 KHz
Amazon Graviton 3 (64 threads)	330 ms	3.6 sec	8.5%	265 KHz

Recursive proofs

Proofs in the above benchmarks are generated using BLAKE3 hash function. While this hash function is very fast, it is not very efficient to execute in Miden VM. Thus, proofs generated using BLAKE3 are not well-suited for recursive proof verification. To support efficient recursive proofs, we need to use an arithmetization-friendly hash function. Miden VM natively supports Rescue Prime Optimized (RPO), which is one such hash function. One of the downsides of arithmetization-friendly hash functions is that they are considerably slower than regular hash functions.

In the benchmarks below we execute the same Fibonacci calculator program for 2²⁰ cycles at 96-bit target security level using RPO hash function instead of BLAKE3:

Machine	Execution time	Proving time	Proving time (HW)
Apple M1 Pro (16 threads)	310 ms	94.3 sec	42.0 sec
Apple M2 Max (16 threads)	280 ms	75.1 sec	20.9 sec
AMD Ryzen 9 5950X (16 threads)	270 ms	59.3 sec
Amazon Graviton 3 (64 threads)	330 ms	21.7 sec	14.9 sec

In the above, proof generation on some platforms can be hardware-accelerated. Specifically:

On Apple M1/M2 platforms the built-in GPU is used for a part of proof generation process.
On the Graviton platform, SVE vector extension is used to accelerate RPO computations.

References

Proofs of execution generated by Miden VM are based on STARKs. A STARK is a novel proof-of-computation scheme that allows you to create an efficiently verifiable proof that a computation was executed correctly. The scheme was developed by Eli Ben-Sasson, Michael Riabzev et al. at Technion - Israel Institute of Technology. STARKs do not require an initial trusted setup, and rely on very few cryptographic assumptions.

Here are some resources to learn more about STARKs:

STARKs whitepaper: Scalable, transparent, and post-quantum secure computational integrity
STARKs vs. SNARKs: A Cambrian Explosion of Crypto Proofs

Vitalik Buterin's blog series on zk-STARKs:

Alan Szepieniec's STARK tutorials:

StarkWare's STARK Math blog series:

StarkWare's STARK tutorial:

STARK 101

License

This project is MIT licensed.

miden-vm's People

Contributors

Stargazers

Watchers

Forkers

22388o hustnn block-0x tankofzion mfaulk 1nsane-dev softwarearchaeologist grjte tnachen sojiritesh2003 itzmeanjan fumuran marcusbuffett willyrgf samuelschlesinger crypto-forks junjiequ123 reyadrahman d-dlteo lleyton woodstock-xkx sin7y aurumlabs w2k-star-forks timgestson bchyl bruceherve tohrnii aga7hokakological fpgaq deimosthemisthios nebula-aac zcloak-network frisitano cryptopunkchina diegomrsantos maxgillett isabella232 k-mkrops valpaq 4rgon4ut dominik1999 byteflyfunny amankrx doytsujin chrischiedo pinkdiamond1 oluwamuyiwa greenhat stskeeps al-kindi-0 tumtum23 laplacekorea vlopes11 benjaminbollen domsteil ruuderie njordhov drmatrixbmc aetheriaxai ekmett eltociear hackaugusto f-gate namnc huaweigu starkoracles matthiasgoergens bergkvist thearchiver violetsakura777 cryptohariseldon dmgolembiowski p0ry-g0n shyba rohinjoshi1 muhtasimtanmoy danijel-enoch hongyuanyang-uu wdshin manipulation44 huitseeker markus0x1 tonk-gg frisitano-polygon andrewmilson open-rollup guoshijiang qredek timmyjose-forks eightfilms kseniakoval789 alessandroitalili thrustchain idmanagement kathihub komomoze darknikki312 weikengchen leviathanbeak

miden-vm's Issues

Incorrect implementation of parse_u32div for u32div and u32div.b operations

The implementation of parse_u32div in the assembly u32 ops parser currently has 2 errors:

The parser accepts a parameter of 0 (u32div.0), which means that the operation fails during execution instead of during compilation.
The parser asserts in handle_arithmetic_operation that the top element after the processor's U32DIV op should be zero, which means that all divisions with non-zero remainder fail.

Debug operation

We need to implement the debug operation according to the specs described here. Implementing this operation would require:

Implementing an options struct for the Debug operation enum here.
Implement parsing of the assembly instruction (will need to be added here).
Implement processing the Debug operation here.

u32 assembly parsers for bitwise ops do not fail for out-of-bounds inputs

As specified by the assembly operations specs, bitwise u32 operations should fail when given inputs >= 2^32.

Currently, the following operations do not fail correctly:

u32and
u32or
u32xor

These errors should be fixed in the in the Bitwise helper in the processor crate.

Implement stack manipulation operations

We should implement stack manipulation operations in v0.2 assembler according to the specs described here.

The basic idea is that we already have parsers stubbed out for stack manipulation instructions here and we also have the VM instructions for stack manipulation defined here. Now, we need to map each assembly instruction to one or more VM instructions.

For example, on the assembly side, we have movupw.n instructions but there are no such instructions on the VM side. However, we can use SWAPW instructions to simulate movupw.n instructions as follows:

movupw.2 -> SWAPW SWAPW2
movupw.3 -> SWAPW SWAPW2 SWAPW3

For additional references see how parsing of field operations is implemented here.

Better handling of decorator operations

Currently, we have a single Operation type which encodes all operations which can be executed on the VM. These operations fall into two main categories:

Regular operations (e.g., ADD, MUL etc.)
Decorators (e.g., DEBUG, ADVICE). As opposed to regular operations, decorators do not advance the clock cycle of the VM, and do not affect the state of the deterministic VM components (e.g., stack, memory).

The idea behind decorators is that we can use this mechanism to provide "hints" to the VM for debugging and other purposes. However, the way the decorators are implemented currently, leads to a couple of undesired consequences.

First undesired consequence is that decorators are relatively "heavy". For example, the current size of an Operation is 48 bytes (ideally, it would be just one byte, though this ideal might be difficult to achieve).

Second undesired consequence is that decorators can affect program hash. Specifically, decorators are placed into span blocks and as long as there is at least one regular operation in the same block, all is fine. But if there are no regular operations, we basically have a block which consists of noops, and there is no way to eliminate it. This should happen relatively rarely - but this edge case is possible, and thus we need to handle it.

To sum up: we want to support decorators, but we want to maintain the following properties:

They should not affect the size of Operation type.
Presence of decorators should not affect program hash.

One way to achieve this is to introduce a separate Decorator type, and have a separate data structure which specifies which decorator is to be executed at a given clock cycles (this would be somewhat similar to how we handle this in v0.1). There might be performance implications to this. Specifically, on each cycle we'd need to check if there are decorators to be executed before executing a regular operation. My hope is that these performance implications would be close to negligible.

However, this still doesn't address the second issue. Consider the following example:

while.true
  <instructions>
end
debug
while.true
  <instructions>
end

The debug decorator here would have to be located in an empty span block, and this would mean that the hash of the program would be different from exactly the same program but without the debug decorator.

One potential solution to this is to also put decorators into other blocks (not just span blocks). But this may add a considerable amount of complexity.

Project Contributions

Hi there,

Is there a way to start contributing to Miden ? Excited about the project would love to get involved in some capacity

Best,

u32 assembly parsers for shift/rotate ops fail with legal parameter value 0

The shift and and rotation operations in the assembly parsers are enforcing parameter bounds on the bit shift value of 1 <= b <= 31, but according to the assembly specs parameter value b = 0 should be allowed.

This should be fixed in parse_u32shl, parse_u32shr, parse_u32rotl, parse_u32rotr in the Assembly ops parsers.

Power of two instruction

The VM currently does not support dynamic bitwise shifts. One of the reasons for this is that emulating shifts using other operations is relatively easy when we know how many bits to shift by statically. For example, u32shl is actually computed as PUSH(2^x) MUL U32SPLIT DROP where x is the number of bits to shift by and 2^x is computed at compile time.

One way to support dynamic shifts of u32 values (which can be used to support dynamic shifts of u256 values), is to compute 2^x in the VM where x is a value on the stack. The operation could be called U32POW2 (or something similar), and could work as follows:

[x, ... ] -> [2^x, ...]

This, of course, can be computed via repeated multiplication by two, but that would be rather expensive, and we should find a better way of supporting this operation.

One thing to note: even if we do have support for this operation, implementing u256 bitwise shifts is going to be far from trivial. So, other approaches to bitwise shifts may need to be explored.

unexpected stack underflow with u32split

TL;DR:

There seems to be an issue in one of the 2 following places:

a problem with how the stack depth is managed for operations that output more elements than they take as inputs
a bug in the check_depth requirement in op_u32split + a lack of documentation about stack requirements for that op

Description

According to the assembly ops specs here, which specify a stack input for u32cast of [a, ...] and stack output of [b, ...] I expected the following minimal execution test to pass, but it fails with a stack underflow error.

#[test]
fn u32cast() {
    let script = compile("begin u32cast end");

    let inputs = build_inputs(&[1]);
    let trace = execute(&script, &inputs).unwrap();
    let last_state = trace.last_stack_state();
    let expected_state = build_stack_state(&[1]);

    assert_eq!(expected_state, last_state);
}

Essentially what's happening is that when the processor does a u32split & attempts to set stack[1] with the low 32-bit values, it panics, because the current (and required) stack depth is only 1. If I add an extra input to the stack then everything works fine, and that input isn't touched.

Relevant backtrace:

stack backtrace:
   0: rust_begin_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:100:14
   2: miden_processor::stack::Stack::set
             at ./src/stack/mod.rs:99:9
   3: miden_processor::operations::u32_ops::<impl miden_processor::Process>::op_u32split
             at ./src/operations/u32_ops.rs:19:9

Question

This seems to be a bug either in the way the stack depth is managed for operations that output more elements than they take as inputs or a bug in the check_depth assertion in op_u32split. What gave me pause was that in the u32split unit test in the processor you're initializing the stack with 2 values, which suggests that maybe the current behaviour is intentional but wasn't documented yet (or is in a place I haven't seen).

What is supposed to be happening here?

Program analysis tool

It would be cool if we had a way to analyze a given script. This would let us make informed judgements about scripts we may have in the standard library. Some interesting statistics that come to mind:

Number of VM cycles it takes to execute the script.
Number of program blocks in the script (for each block type).
Number of cycles per assembly instruction. E.g., a program executed u32add.full operation x times, and this translated to y VM cycles.

Program analysis task list (grjte edit):

Add fmp register the processor with associated operations

As discussed in #73, to better support memory-based locals, we need to add fmp register to the processor. Operations needed to manipulate this register have been stubbed out in #83 - but we also need to implement them.

Additional advice sets

Advice sets (src) are used to provided structured non-deterministic inputs to the VM. Currently, the only concretely implemented advice set is a Merkle tree. But anything that meets the requirements of an advice set interface could also work. The requirements of the interface are roughly as follows:

An advice set must have a root.
It should be possible to retrieve a node from the advice set from a given depth and position.
It should be possible to retrieve a Merkle path from a given node to the root.
It should be possible to update a node of an advice set. Any such update should change the value of the root.

Some other options for advice sets include:

Sparse Merkle trees.
Collections of Merkle paths (all of which resolve to the same root).

Procedure context tracking in the processor

Currently, the processor is unaware of the procedure context of an executing program. That is, the processor doesn't know which procedure is being executed, which procedure called the current procedure etc. This limits usefulness of debug info. Thus, it would be very useful if the processor maintained a "procedure stack".

To make a process aware of which procedures are executing, we could introduce two new decorators. Something like:

ProcStart - which the assembler would add at the start of each procedure.
ProcEnd - which the assembler would add right before a procedure returns.

Then, the processor would push a procedure onto a stack when it sees ProcStart decorator in the instruction stream, and remove a procedure from the stack when it sees ProcEnd decorator. This should be very lightweight - so, I don't think we should worry about performance implications.

Inconsistent assembly errors in the assembly parser for malformed operations

In validate_op_len we return AssemblyError::invalid_op when any part of the instruction is missing. Examples of this would be:

calling parse_adv with "adv" which is missing the push.n or loadw portion
calling parse_push with ""

In many of the parsers for the u32 operations (and possibly other places) we return AssemblyError::missing_param(op) in these cases. An example of this is calling parse_u32div with the op ""

Obviously the case with the u32 ops is trivial, but for consistency I think we should pick one error to use in all equivalent cases when the base instruction is malformed. Thoughts?

Potential optimization for loading values from memory

Currently, at the VM level we have a single loadw operation which is used to load data from memory onto the stack. This operation is used as the basis for loadw.mem and pushw.mem assembly instructions. Specifically, loadw.mem translates directly to loadw operation, while pushw.mem pads the stack with zeros before executing loadw.

Thus, loadw.mem requires 1 VM cycle, while popw.mem required 6 VM cycles. We can make these a little more balanced by changing the semantics of loadw processor operation. Specifically, currently stack transition diagram for loadw looks as follows:

[addr, 0, 0, 0, 0, .... ] -> [a3, a2, a1, a0, ...]

where, addr is the memory address from which to load a word, and a0, a1 etc. are elements of the word located at that address. The net result is that the stack is shifted the left by one.

With the new approach we could do the following:

[addr, 0, 0,  .... ] -> [a3, a2, a1, a0, ...]

The net result would be that the stack is shifted to the right by one.

With this approach both loadw.mem and pushw.mem instruction can be translated to 4 VM operations:

loadw.mem: MOVDN2 DROP DROP LOADW
pushw.mem: PAD PAD MOVUP2 LOADW

The net effect is that we reduce cycle count of pushw.mem by 2 cycles while increasing cycle count for loadw.mem by 3 cycles. Thus, whether this optimization is worth it should be dictated by relative prevalence of pushw.mem vs. loadw.mem in real-world programs.

Implement comment parsing

Currently, the assembler does not support comments in the source code. We should implement parsing of comments according to the spec described here.

Refactor i/o assembly instructions

As discussed in #44 and in #73 we should refactor i/o assembly operations to follow the what.where format and also add new operations for handling memory-based local variables. Overall, the new i/o instruction set should be as follows:

Pushing values onto the stack:

push.xyz - similar to current push but should allow pushing multiple values (up to 8). For example, the following should be valid: push.1, push.1.2, push.1.2.0x3 etc.
pushw.0x... - this should push a single 32-byte value provided in hexadecimal format. Semantically, this should be equivalent to push.a.b.c.d where a is the first 8 bytes of the value, b is the next 8 bytes etc. (although, maybe we should reverse the order?).
push.evn.sdepth
pushw.mem.addr
pushw.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.
push.adv.n where n can be up to 8.

Removing values from the stack:

popw.mem.addr
popw.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.

Overwriting values on the stack:

loadw.mem.addr
loadw.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.
loadw.adv

Save stack values without removing them:

storew.mem.addr
storew.local.idx where idx is the index of the local variable. using this instruction outside of a procedure should be an error.

To enable handling locals, we'll also need to update procedure parsers. The new procedure declaration format should include an optional specifier for the number of memory-based locals the procedure can access. For example:

proc.foo.2   # needs access to 2 memory-based locals
proc.bar     # same as proc.bar.0

If code within a procedure tries to access a local outside of the specified bounds, the assembler should return an error.

Panic in u32div operation processor

u32div operation processor currently incorrectly handles division by zero. It panics, but instead should return ExecutionError::DivideByZero.

We should also update relevant u32tests to expect an error instead of a panic.

Move integration tests into a separate crate

We currently put all of the integration tests into the processor crate. This may not be the best place for them as the test a bunch of things besides validity of processor execution (e.g., assembly compilation, correctness of stlib procedures). In the future, we'll probably add more integration tests which go beyond just the processor (e.g., testing proof generation/verification process).

It might be a good idea to move integration tests into a separate crate to make the overall project structure cleaner and more intuitive.

Range checks for u32 operations

As described in u32 operations note, these operations need to rely on a set of helper registers. This is not yet implemented because the helper registers will actually be located in the decoder which is currently not implemented.

Some of the helper registers are needed to perform 16-bit range checks. And while we can't yet populate these registers with needed values, we can make lookups into the RangeChecker for for these value (this can be done via RangeChecker::add_value() method).

At the high level, there are two things we should do:

For u32 operations which require range checks, we should make sure we add the required values to the RangeChecker. It might make sense to have some helper functions to handle this rather than duplicating this lookup code in every operation handler.
Once the above is done, executing u32 operations would result in the RangeChecker trace being populated. We should integrate this trace into the overall execution trace similar to how we did this for co-processor execution traces.

Establish consistent pattern for testing exact errors

As discussed in PR #40 , we want tests to be checking for exact errors according to Rust best practices.

TL;DR: there are different options and no perfect consensus, so we should make an opinionated decision

Background

It seems like Rust actually doesn't handle error testing as ergonomically as most other things, so there isn't an opinionated "right answer." Most errors in Rust implement the PartialEq trait, which allows using assert_eq! in tests to check exact errors. However, a few kinds of errors don't, most notably std::io::Error. See the reference discussions below for more detailed reasoning, but it can be summarized by "It's an unfortunate ergonomic loss that io::Result instances can't be equated, but indicating that io::Error has an implementation of PartialEq is somewhat of a lie because the only type of error that can be equal is two OS errors"

Possible approaches to testing errors

Option 1: Implement PartialEq trait for custom errors

The info above suggests that either deriving PartialEq (where possible) or manually implementing it for custom error types and then using assert_eq! is an acceptable idiomatic option (as recommended in the referenced Stack Overflow discussion as well).

Option 2: Use `should_panic` or return a `Result<T,E>` from the test

The book presents 2 options:

use should_panic with an expectation string
give the test a Result return type and return the expected Error

This is possibly the "best practice" given that it's in the book, but it's not always fine-grained enough & many people seem dissatisfied with it. Additionally, both options mean that there will be strictly one error checked per unit test, which makes sense with the idea of a "unit test" but could cause a blowup in the number of tests, which may or may not be desirable (needs an opinionated decision).

should_panic can be improved upon by using catch_unwind which "allows you to put the code you expect to panic into a closure, and only panics emitted by that code will be considered expected (i.e. a passing test)." See the second SO reference for a discussion & links to examples of this approach being used in the Rust standard library. This approach cannot catch non-unwinding panics.

Final note: using should_panic can give different output depending on how it's executed. "As you can see, the output of the second should_panic test case contains a backtrace. However, if you run your tests with the simple cargo test <test_name> command, the output of the both test cases are the same. So, be careful when you configure your CI/CD system!" (see the reference from Yury Zhauniarovich)

Option 3: Use `assert_matches` crate

We could use this crate to test pattern matching. It gives more control than the second option without the overhead of implementing PartialEq. This was mentioned in the Rust discussions referenced below, but I haven't seen it mentioned otherwise.

References:

unit testing errors (PartialEq):
https://stackoverflow.com/questions/57234140/how-to-assert-io-errors-in-rust
testing with should_panic (catch_unwind):
https://stackoverflow.com/questions/26469715/how-do-i-write-a-rust-unit-test-that-ensures-that-a-panic-has-occurred
article on ins & outs of testing errors:
https://zhauniarovich.com/post/2021/2021-01-testing-errors-in-rust/

Discussions with the Rust team:

rust-lang/rust#34192
rust-lang/rust#34158

Optimize the logic for inlining procedures

The way procedures are parsed and inlined causes extra noop operations to be added and can be optimized.

As noted in the specs, the rules for grouping operations inside of span blocks are:

the maximum number of operations per group is 9 (since a group is represented by a single 64-bit field element and each opcode is encoded with 7 bits)
the first group of a batch can start with any operation
for all other groups in a batch (groups 2-8), an operation carrying an immediate value (e.g. push.2) cannot be the first operation in the group. In this case, the VM must insert a noop in front of it.

Currently, it seems that noop insertion is done during procedure parsing and then again during procedure inlining, causing extra unnecessary noop instructions to be added.

Here are some (very contrived) examples to demonstrate:

Example - Procedure Without Locals

Assembly:

    proc.foo
        push.0 push.0 push.0 push.0
        drop drop drop drop
        push.3 push.7 mul
    end
    begin
        push.2 push.3 add
        exec.foo
    end

Compiled:

    begin
        span
            push(2) push(3) add
            pad pad pad pad
            drop drop drop drop
            push(3) noop push(7) mul
        end
    end

The noop is added during the parsing of the procedure where push(7) would be the 10th instruction and would therefore be the first instruction in a group. However, once the procedure is inlined, this noop is no longer necessary, because at this point the second group starts with drop. The batches in the compiled script seem to regroup the operations correctly such that the push(3) and push(7) from the procedure are no longer in different groups, as seen here:

Op Batches:

op_batches: [
        OpBatch { 
          ops: 
            [Push(BaseElement(2)), Push(BaseElement(3)), Add, Pad, Pad, Pad, Pad, Drop, Drop, Drop, Drop, Push(BaseElement(3)), Noop, Push(BaseElement(7)), Mul], 
          groups: 
            [BaseElement(1016745240677138690), BaseElement(2), BaseElement(3), BaseElement(241055074062), BaseElement(3), BaseElement(7), BaseElement(0), BaseElement(0)] 
        }]

Example - Procedure With Locals

Assembly:

        proc.foo.3 
            push.0 push.1 
            storew.local.2 
            add 
            push.0 
            loadw.local.2 
            mul 
            push.0 push.0 push.0 drop drop 
        end 
        begin 
            push.4 push.3 push.2 push.1 
            exec.foo 
        end

Compiled:

        begin 
            span 
                push(4) push(3) push(2) pad incr 
                push(3) fmpupdate 
                pad pad incr 
                push(2) neg fmpadd storew 
                add 
                pad 
                noop push(2) neg fmpadd loadw 
                mul 
                pad pad pad drop drop 
                noop push(3) neg fmpupdate 
            end 
        end

In this example, the first noop is added on line 9 during procedure parsing where the 1st operation of the second group of the procedure operations is push(2) (the first VM instruction for loadw.local.2). Once the procedure is inlined, the placement of this noop is no longer necessary. However, once all operations are combined, the second noop needs to be added, since, after the inlining, the push(2) on line 12 would be the first operation in a group. Notice that without the first unnecessary noop this second noop would not be required.

Implement handling of u32 operations

We should add support for u32 operations in v0.2 release according to the specs described here.

This will also require updating VM instruction set to support u32 operations.

Consider reformat of pushw assembly instruction

As discussed in the PR implementing parsing of pushw, it may make sense to allow the pushw instruction to take a single 32-byte hex value either instead of or in addition to the current approach of taking 4 inputs of 1 element each.

This is inspired by this contrived but strange example with the current implementation:
let op = Token::new("pushw.1.23.0x1C8.0", 0);

From Bobbin:
"now that I'm seeing how this actually looks, I wonder if we should change the format of pushw instruction to always take a single hexadecimal value of 32 bytes. Or maybe making this as an additional allowed format. So, for example, possible formats could be:

pushw.1.2.3.4
pushw.0x13244... (32 bytes)"

Stacking of auxiliary traces

The VM has several components traces of which should be "stacked" so that they all share the same set of execution trace columns. These components are:

Hasher
Bitwise processor
Memory controller

All of these components already implement fill_trace() methods which should help with the process. The overall procedure for building a trace is as follows:

First, we need to determine overall length of the execution trace (this would happen when we convert Process struct into the ExecutionTrace here). This length should be computed as be the max of the lengths of all segments of the trace. For the purposes of this issue, we care only about two segments: stack trace and auxiliary table trace.

Stack trace can be retrieved via Stack::trace_length() (which should probably be renamed into trace_len() for consistency). All components of the auxiliary table also expose trace_len(). So, the length should be: max(stack_len, hasher_len + bitwise_len + memory_len). And then we need to pad this length to the next power of two.

Once we have this length, we should allocate 18 vectors of this length, break it into fragments corresponding to lengths of each auxiliary component, and use fill_trace() to fill corresponding segment. We fill the fragments as follows:

First we fill the Hasher fragment. For this fragment values in column 0 are set to ZERO, and the remaining 17 columns contain the actual hasher trace.
Then we fill the Bitwsie fragment. For this fragment values in column 0 are set to ONE and values in column 1 are set to ZERO. The next 13 columns contain the actual bitwise processor trace. The remaining 3 columns are padded with ZERO
Then we fill the Memory fragment. For this fragment values in columns 0 and 1 are set to ONE, values in column 2 are set to ZERO. The next 14 columns contain the actual memory trace, and the last column is padded with ZERO.
If the above doesn't fill all the table completely, we need to pad the remaining rows. For padded rows values in the first 3 columns are set to ONE, and values in remaining columns are set to ZERO.

Even though the above is described sequentially - we can fill all fragments in parallel.

Also, the Bitwise trace is currently missing two columns (so, currently in the code the number of bitwise columns is 11 rather than 13). These are selector columns which we need to identify the bitwise operation being performed.

Standard library v0.1 content

Now that support for standard library imports has landed in #121 we should start discussing what we want to include into the initial release of stdlib. Here are my preliminary thoughts on the library content:

math
- u64 - arithmetic and bitwise operations with 64-bit unsigned integers.
- u128 - not sure if we need this, maybe skip for v0.1?
- u256 - needed to support EVM. we should discuss if we want to try to support all operations in the original release.
crypto
- hashes
  - blake3 - already being worked on in #77
  - sha256 - already being worked on in #123
  - Keccak256- probably defer to future versions.
- sigs (verification only)
  - horst - an instantiation of HORST signature scheme using default hash function of the VM. Could be a good way to test out mtree.get instruction.
  - schnorr - an instantiation of Schnorr signature using the default curve of the VM. Definitely delay till later.
evm - EVM-specific built-ins not covered above. Probably defer to future releases.

Anything else I'm missing?

Make order of input arrays consistent across all test modules

Currently, some test modules take input arrays in order while other modules require the inputs to be in the order they would be in after being pushed onto the stack, which is the reverse order.

We should make the order of input arrays consistent across all tests. Specifically, all test input arrays should be provided with data in order (not stack order), and we can update the test helper functions as required.

Changes are required in the following test modules:

All processor integration tests (not unit tests):

processor crypto_ops tests
processor field_ops tests
processor flow_control tests
processor u32 ops tests

Potential rationalization of bitwise operations

As I learned from @dlubarov, it is possible to emulate bitwise AND, OR, XOR operations using just one of these operations and some filed element operations. Specifically, assuming we have access to AND, the other two can be described as:

XOR(x, y) = x + y - 2 * AND(x, y)
OR(x, y) = x + y - AND(x, y)

Thus, we could eliminate 2 out of the following 3 operations: U32AND, U32OR, U32XOR. This wouldn't save us much in terms of trace length or constraint complexity, but it would allow us to remove 2 instructions from the instruction set if needed.

If we do go this route, it probably makes sense to keep XOR operation as the one natively supported by the VM and emulate the other two through it.

Instruction for computing absolute address of a local variable

In some situations it may be useful to get the absolute memory location of a local variable.

For example, let's say we have a procedure for multiplying two 256-bit integers (u256:mul) this procedure expects memory addresses of operands a and b to be on the top of the stack, and after it performs the multiplication, it will leave the result on the top of the stack (we could have passed operands a and b via the stack as well, but for the purpose of this example, we do it via memory).

Thus, to invoke u256:uml the caller needs to put absolute addresses of a and b onto the stack, but if a and b "live" in local memory of the caller, it is not clear how to do that (since we don't have an instruction for accessing the value of fmp directly).

We could add assembly instructions to access fmp and then the caller would be able to compute the absolute address, but i wonder whether having a more specialized instruction is more convenient. I'm thinking of something like this (though open to other name/format suggestions):

push.env.locaddr.i

where i is the index of the local. This instruction would read the fmp, subtract i from it, and push the result onto the stack.

The only downside of this I can think of is that it would enable potential abuse of fmp where a procedure could return absolute address of its local (which would be a logical error).

Stack cleanup before program exit

One of the conditions we currently impose on program execution is that after a program completes, there must be no items left in the stack overflow table (i.e., stack depth must be 16). It is probably a good idea to have the assembler take care of this. Specifically, at the end of the main script body the assembler could include a call to a procedure which make sure the extra items are dropped from the stack.

This procedure should probably be located in the standard library (maybe in std::sys module?), and it should ensure the following:

The depth of the stack after the procedure executes must be 16.
The top 16 items which were on the stack before the procedure was called must remain on the stack after the procedure exited.

swapdw assembly instruction

As mentioned in #65, we might introduce swapdw (swap double word) operation at the VM level at some point. It might be a good idea to have this operation supported in the assembly even before then. This operation would be useful for working with u256 values (as each u256 value is represented with 8 32-bit limbs).

Semantics of the operation would be as follows:

[a7, a6, a5, a4, a3, a2, a1, a0, b7, b6, b5, b4, b3, b2, b1, b0, ... ] ->
[b7, b6, b5, b4, b3, b2, b1, b0, a7, a6, a5, a4, a3, a2, a1, a0, ... ]

For now, we can implement this using our existing SWAPW operations.

Implement cryptographic operations

We should implement cryptographic operations in v0.2 assembler according to the specs described here.

This will also require updating VM instruction set to support missing operations.

Question on Memory Addressing Scheme

Hi,

I was reading Miden Assembly specification's RAM addressing modes

In context of function local memory and absolute index based memory addressing, I was wondering how do I handle following depicted scenario.

# Imagine I've following Miden Assembly code, where I'd like to use local memory ( reserved while
# invoking `fn0` ) in `fn1` procedure, which is invoked from `fn0`
#
# i ) But following program instead accesses memory locations by absolute indices [ not desired ]
# ii ) If I declare `fn2`, instead of `fn1`, then it reserves another 2 VM words; lets me work with that [ not desired ]
#
# Question: How do I refer `fn{1,2}` to use local memory reserved while invoking their parent function `fn0` ?

    proc.fn2.2
        pushw.local.0
        pushw.local.1
    end

    proc.fn1.0
        pushw.mem.0
        pushw.mem.1
    end

    proc.fn0.2
        push.3.2.1.0
        push.7.6.5.4

        popw.local.1
        popw.local.0

        exec.fn1
    end

    begin
        exec.fn0
    end

Implement handling of field operations

Currently, the assembler handles only a limited set of field operations. We should expand it to support all operations mentioned in the specs.

This would also require updating VM instruction set to support missing operations.

Single-element memory operations

Current assembly instructions for memory access deal with words (4 field elements). This creates quite a bit of over head for use case when we want to read/write a single field element. The overhead is in both places: the VM itself and the assembly instruction sequences needed to accomplish single-element reads/writes.

Adding dedicated operations to read/write single elements at the VM level is possible, but we should probably do it at a later date, once we have a better picture of how frequently these operations are used. However, we could add instructions at the assembly level to help with this. The instructions I'm thinking of are:

push.mem.addr - pushes the first element of the word at addr onto the stack.
push.local.idx - same as above but for a local.
pop.mem.addr- pops the element off the stack and saves it addr.
pop.local.idx- same as above but for a local.

Standard library organization

Given that there are already things on the horizon which can go into standard library (e.g., #77), I think it might be a good time to think about how to organize standard library.

I'm inclined to keep things as simple as possible (at least at this point) - so, a few assumptions:

Let's not worry about user-defined libraries at this point. We can come back to these later.
Let's assume that standard library is made up only of procedures as described in assembly specs. We can differentiate between library and local procedures at a later point.
Let's assume that the library will relatively small.

The way I'm currently thinking about is that the assembler will load the full standard library at instantiation time. For this, we'll need to build a procedure map (similar to procedure map for local procedures) in the assembler constructor. Then, when a script is being parsed, the assembler can look up a given procedure in this map and inline it as it does with local procedures.

A few preliminary questions to consider:

Should standard library live in assembly crate or its own crate? I'm inclined to think that in its own crate (e.g., stdlib).
What extensions should we use for assembly files? I'm thinking .ma (Miden assembly) - but open to other suggestions.
Should we support module hierarchies? It seems like we should - but what's the simplest way to do it?
- This is probably by far the most complicated part. If there are modules, can some modules import procedures from other modules? How do we handle module resolution? etc.
Should we have something similar to use statements in Rust? Or should standard library procedures be always referred to by fully qualified names? e.g., exec.crypto:hashes:blake3 vs use std:crypto:hashes:blake3 and then in the code exec.blake3.
How does the assembler load the standard library? Would it build one giant string of all the code in the library and parse it in one go? Or would it do so file by file? And if file by file, what determines the order?

Potential instruction to help with u32 safety checking

For most u32 operations the "safe" variants are much more expensive than the "unsafe" variations - frequently at least 5x or even 10x more expensive. This might make the safe variants much less desirable to use. One of the reason for this is that asserting that a value is a u32 value is emulated by a sequence of 3 operations: U32SPLIT EQZ ASSERT. Thus, many instructions which work with 2 operands, asserting that the operands are 32 bit values requires 7 or 8 operations.

We could make safe operations much more efficient by introducing U32ASSERT2 operation in the VM. This operation would check whether the top two items on the stack are u32 values and if either one of them is not, return an error. Thus, we'd be able to do most safe operations at just 2x the cost of the unsafe ones.

From AIR constraint standpoint it shouldn't be too difficult to support this - i.e., this would require 4 16-bit range checks, which we need to support for other operations anyway (e.g. u32 mul).

Step-by-step program execution

For debugging purposes, it would be great to have a way to execute a program one cycle at a time, and to be able to read the state of the VM at every cycle. One way to do this is described below:

First, we could create a separate function in the processor crate (where the execute() function is defined) to handle step-by-step execution. This function could be called execute_iter() and it would return an iterator over process states (we'll also need to define a struct which represents a process state).

Internally, execute_iter() could do the same thing as execute() but instead of converting a Process into an ExecutionTrace, it would iterate over the data contained in the process and return it for a given cycle. Basically, we would execute the program to the end (or until there is an error), and then start reading generated states one-by-one as a part of iteration process. This would be relatively "wasteful", but I think that for debug purposes its fine.

As mentioned above, the Process struct already contains most of the historical data that we'd need (e.g., it keeps track of how memory changed with every cycle). There are two things for which it keeps only the latest state (rather than all historical states):

Stack overflow table.
Advice provider.

I think we can tackle advice provider part at a later time, but it would be good to come up with a way to keep historical states of the stack overflow table. This will probably be relatively heavy - so, this functionality should be optional. Meaning, we'd use it only when executing programs from execute_iter() context, but not when using the regular execute() context.

Hash operation consistency

There are currently some inconsistencies in how we've implemented hashing operations. Specifically:

In the VM, RPPERM to compute permute(A, B, C), expects the stack to be arranged as [A, B, C, ...] where A is at the top of the stack.
In the assembly, rpperm and rphash operations assume that the stack is arranged as [C, B, A, ...] where C is the capacity portion of the sponge and is located at the top of the stack. In case of rphash we assume that once RPPERM is executed, the word A will contain the result of the hash.

One option to address this is to simply fix the implementation of RPPERM in the VM. However, it might be a good idea to have the capacity portion of the sponge deep in the stack (the way it is currently arranged for RPPERM). This way, when we compute a linear hash of many elements, we don't need to touch capacity elements (word C) as we update the outer portion of the sponge (words A and B).

At the same time, the arrangement [A, B, C, ...] is contrary to how we arrange stack for most other operations and might be confusing. For example, as we compute linear hash of many elements, we'd first need to populate the word A and only then the word B with new data - which will be awkward to do since A is at the top of the stack.

The other option is to modify how the Rp64_256 hash function actually works. Specifically, we can change it so that if the state is [A, B, C], it is expected that A is the capacity portion, and B is expected to contain the result of the hash after the permutation.

So, to summarize the options:

Fix RPPERM operation in the VM. This is the simplest option but would make linear hashes of many elements slightly less efficient.
Update assembly instruction rphash to work with the current implementation of RPPERM. This accepts the awkwardness of the current stack arrangement of RPPERM and also would make computing linear hashes of many elements less efficient.
Update Rp64_256 hash function to work as described above, and update RPPERM and related assembly instructions accordingly. The stack would be arranged as: [C, B, A, ...]. This is a more involved option, but perhaps the cleanest and most efficient.

u32testw parser does not output correct VM instruction sequence

As mentioned in the comment here current implementation of u32testw assembly instruction parser (parse_u32testw()) produces an incorrect instruction sequence. We could fix it as suggested in the comment, but there might be a better way.

In general, we want u32testw to leave 1 at the top of the stack if all 4 top stack elements are u32 values and 0 otherwise. Thus, we can test if a given element is a u32 (leaving 1 at the top of the stack if it is) and then use AND operation to combine these values. Thus, we don't need to have NOT operations after EQZ and then we can replace OR operations with AND. This should also reduce the cycle count needed for this operation by 4.

So, for example, to test the first element, we'd do:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);

Instead of the current:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);
span_ops.push(Operation::Not);

And for the second, we could do:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);
span_ops.push(Operation::And);

Instead of the current:

span_ops.push(Operation::Dup3);
span_ops.push(Operation::U32split);
span_ops.push(Operation::Swap);
span_ops.push(Operation::Drop);
span_ops.push(Operation::Eqz);
span_ops.push(Operation::Not);
span_ops.push(Operation::Or);

Incorrect doc comments in the processor

The sys_ops file in the processor is mis-commented - it has i/o and push comments instead.

I'd be happy to fix if you want, but I know there's a lot of active work in the processor right now.

Management of local variables in memory

As described in the assembly specs, Miden VM has word-addressable read-write memory. There are a few load and store instructions which are responsible for moving data between the memory and the stack. These instructions take memory address from the top of the stack.

From code organization standpoint, Miden assembly supports procedures and (will support) functions. The difference between the two is as follows:

Procedures are in-lined into the program MAST at compilation time. Thus, semantically, procedures don't differ from the surrounding code. This, for example, means that they can access the same memory as surrounding context.
Functions will be used to represent "untrusted" procedures (one use cases for these is cross-contract calls). A called function won't have access to the caller's memory and vice versa (this also implies that all parameters are passed by value between functions via the stack, but that's a separate topic).

In the context of managing local variables, we have the following situations to consider:

Local variables for simple procedures (simple here means that all local variables can live on the stack).
Local variable for complex procedures.
Local variables for functions.

Handling local variables for simple procedures is straight forward - all variables live on the stack and the procedure doesn't need to allocate memory for storing locals. Similarly, handling local variables for functions is straight forward: each function gets its own memory and can place variables w/e it sees fit.

However, for complex procedures which can't fit all locals onto the stack, it is not immediately clear where they should store locals in memory to make sure one procedure doesn't overwrite locals of another procedure. There are several approaches to handle this.

Option 1

We can do something similar to what the EVM does: have a convention for putting the "free memory pointer" (fmp) somewhere in memory (in our case, probably at address 0x0) and then use this free memory pointer to indicate where a procedure could put its locals.

This is the simplest approach in terms of implementation (we don't need to change anything at the VM level), but it would introduce quite a bit of runtime overhead. Basically, each time we need to read or write a local variable to memory, we'd need to read and possibly overwrite fmp value. Thus, in the worst case, each access to a local variable may require 2 - 3 memory accesses and a bunch of instructions in-between. Overall, I'd expect moving local variables between stack and memory to take up dozens of VM cycles with this approach.

Option 2

We can still store fmp at address 0x0, but we would modify the semantics of load and store operations to implicitly read from this address, add the value at the top of the stack to the fmp and that would be the memory address for the operation. Basically, the memory address would be computed as: mem[0] + stack[0], but this would be done implicitly.

This would reduce the number of cycles we'd have to expand, but we would still need to read write fmp from memory which would add rows to the memory trace.

Option 3

We can have a dedicated register for fmp and would modify the semantics of load and store instructions to compute addresses as fmp + stack[0] (basically, top of the stack provides an offset relative to fmp). The value of fmp would be set to 0 at the start of the program.

The advantage of this approach is that we drastically cut down on the number of cycles consumed by load and store operations, and each load/store still translates to a single memory operation. The downsides are twofold:

We need to allocate extra register for fmp which would only be used for the purposes of load/store operations. I don't think this is a big deal.
We need to have a way to set and (possibly) get the value in fmp for this we would need to have 1 or 2 dedicated instructions.

The reason I said possibly for getting the value of fmp is because we might be able to get away with just setting the value using relative offsets. For example, if a procedure needed to allocate 3 words of memory for its locals it could do something like this:

push.3 setfmp

This would be equivalent to fmp ← fmp + 3. And then the procedure could store its locals at addresses fmp, fmp - 1, and fmp - 2.

Once the procedure is done, it could reset the value in fmp as follows:

push.3 neg setfmp

This would be equivalent to fmp ← fmp - 3. We could also change the format of push instruction to accept negative values to make this a bit more ergonomic.

Simple procedures which don't need to store locals in memory won't need to touch values in fmp at all.

One major drawback of options 2 and 3 is that managing absolute memory pointers becomes very tricky (and maybe even impossible). So, probably more thinking is needed here.

Better comment format

Currently, comments in Miden assembly must be surround by # characters. Thus, even for single line comments we must write:

# this is a comment #

It is a bit annoying that we need to terminate single line comments with #. The main motivation for doing so was ease of parsing, and in fact parsing this style of comments is very easy (they are stripped out from the source code as implemented here. But maybe there are fairly simple ways to strip out comments which terminate with a newline instead of # as well?

u64 module for stdlib

As mentioned #125, we should add a module for 64-bit unsigned integer operations to the standard library. Initially, it should probably include the following procedures:

add
sub
mul
div
mod,
eq
eqz
lt
lte
gt
gte
and
or
xor
shl - might be difficult without #126
shr - might be difficult without #126

Also, for most of these (excluding bitwise operations), we probably should do the _unsafe variants.

Local procedure memory overwrites memory from outer context

Tests added in #117 demonstrate how local procedure memory currently overwrites memory from the outer context.

    let script = compile(
        "
        proc.bar.2
            pop.local.0
            pop.local.1
        end
        begin
            pop.mem.0
            pop.mem.1
            pop.mem.2
            pop.mem.3
            exec.bar
            push.mem.3
            push.mem.2
            push.mem.1
            push.mem.0
        end",
    );
    let inputs = [1, 2, 3, 4, 5, 6, 7];

    test_script_execution(&script, &inputs, &[7, 6, 5, 4, 1]);

Instead of the expected final stack looking like [7, 6, 5, 4, 1, ...], the resulting stack is [7, 2, 3, 4, 1, ...], indicating that absolute memory at addresses 1 and 2 was overwritten by the locals at 1 and 0 respectively.

It looks like the reason for this is that fmp is only ever updated in the op_fmpupdate function, so local procedure memory for procedures called directly from the main context is always offset from 0.

Implement input/output operations

We should implement input/output operations according to the specs described here.

This will also require updating VM instruction set to support missing operations.

Project Status Question

Hi there,

Is Miden ready to start developing on for developers or would it be advisable to wait for a later release ? Excited about the project but just wanted to get a sense of how stable the APIs are and what release version is targeted at new developers.

Thanks so much :)

Best,

Memoization of hash execution traces

Currently, Hasher component of the processor always builds traces for hash computations from scratch. This happens even in cases when computing hashes of the same values more than once.

We can improve on this by keeping track of the hashes which have already been computed, and just copying the sections of the trace with minimal modifications. Specifically, the only thing that needs to be updated when computing hash for previously hashed values is the row address column of the trace - everything else would remain the same.

This can also be done at a higher level. For example, we could keep track of sections of a trace used for Merkle path verification and then, if the same Merkle path was verified more than once, we can just copy the relevant sections of the trace (again, with minimal modifications).

Machine architecture for Miden VM

Miden VM is currently a stack machine, and while we are fairly far down the path of implementing it as such, it's worth listing pros and cons of different architectures, and taking another look at the alternatives.

I have the following architectures in my consideration set:

Stack machine: all instructions work with the top of the stack, and the stack can grow fairly large (unlimited, for all practical purposes).
Pseudo-stack machine: has a fixed set of registers (say 16 or 20) which behave similar to a stack - i.e., instructions still work with the top of the stack, but when items "fall off" the stack they are lost forever.
Register machine: instructions works with a set of general purpose registers - i.e., instructions read from one or more registers and write to one (or more) registers.
Memory machine: instructions read and write directly from memory.

With all of the above, we assume that there is also random access memory and instructions can move data btween stack/registers and memory.

The sections below attempt to list pros and cons of each approach, though, the lists are probably not going to be exhaustive.

Stack machine

Pros:
- Compact instruction format: instructions can be encoded with very few bits. Specifically, we need just 6 or 7 bits to encode a single instruction. This means we can pack 9 or 10 instructions into a single field element, and this greatly improves efficiency of computing program hash (a single permutation of Rescue prime can hash 70 - 80 instructions).
- Variable instruction inputs: it is fairly easy to have many different instruction formats, and have instructions which consume lots of inputs. For example, we have MPVERIFY instruction which consumes 10 field elements as inputs and MRUPDATE instruction which takes 14 field elements as inputs.
- Relatively easy handling of trusted procedures: when executing trusted code, we can be sure that the code will consume items at the top of the stack and items deep in the stack will still be there when the procedure returns. This also makes passing parameters to and reading return values from procedures pretty easy (everything is always at the top of the stack).
Cons:
- Stack manipulation instructions: we need to continuously move data at the top of the stack so that it is arranged just right for a given instruction. This leads to two cons:
  - First, we need to have lots of instructions which manipulate the stack. In the current instruction set, almost half of all instructions are there to move data on the stack.
  - Second, the programs become longer because a big part of each program is taken up by stack manipulation. It doesn't matter too much in terms of program size (given how compact the instructions are, most programs are still probably going to be smaller than with other alternatives), but it does waste VM cycles.
- Overflow table: since we can't keep all of the stack in registers, we need to have an overflow table. This table adds some complexity to AIR constraint and also adds about 8 - 10 extra columns to the execution trace (this is roughly 10% of the total number of columns).
- Checking instruction validity: when executing instructions we need to keep track of the stack depth and make sure instructions fail if the depth is not sufficient for a given instruction. This also creates some overhead (in the naive approach, we need to waste a range-check for every instruction, though, there are better, albeit more complicated approaches to handle this).
- Single-item shifts: to keep the AIR constraints manageable, we can shift the stack by at most one item per cycle.
- Relatively difficult handling of untrusted procedures: with untrusted code, we need to make sure that invoked procedure doesn't overwrite the portion of the stack it is not supposed to touch. This can be done by adding various guards to indicate how far down the stack can an invoked procedure access - but this adds complexity.

Pseudo-stack machine

Pseudo-stack machine inherits a lot of the pros and cons of the stack machine, but there are some differences.

Pros:
- Compact instruction format: we still get the same compact instruction format.
- Variable instruction inputs: we can still have many different instruction formats.
- No overflow table: we don't need to worry about overflow table and this reduces complexity and makes the VM a bit more efficient.
- Multi-item shifts: we can shift the stack by multiple items per cycle. For example, we could have a single DROPW instruction to drop top 4 items of the stack.
- No need to track stack depth: we can assume that all registers are always initialized to some values (e.g., initially hold zeros) - so, no need to check if the right number of operands is present on the stack. This also allows us to use instructions in more contexts. For example, we can always use SWAPW3, even if we don't have all 16 items on the stack.
Cons
- Stack manipulation instructions: we still have to deal with stack manipulations, though, being able to use large SWAP instructions makes this a tiny bit better.
- Relatively difficult handling of procedures: since items may "fall off" the stack never to return, we can't be sure that even a trusted procedure won't result in some data loss. Thus, we either need to save the data we care about into memory before invoking procedures, or have some way to provide guarantees that a given procedure won't shift the stack by more than n items during its execution.
  - However, if we can solve this for trusted code, we should be able to use the same approach for untrusted code as well.

Register machine

Here I'm assuming we have a set of general purpose registers (probably between 8 and 16), and instructions can reference any of these registers directly.

Pros:
- Relatively concise programs: there is no need for stack manipulation instructions, which can save a lot of VM cycles.
Cons:
- Small number of instruction formats: we can probably have a few instruction formats, but most instructions will read values from one or two registers, preform an operation, and put the result into a third register. This will make supporting complex instructions (e.g. MPVERIFY) rather complicated.
- Bigger instructions: we'll need to dedicate 3 - 4 bits per register reference, and thus, instruction size would need to be around 20 bits (or more). Thus, we'll be able to fit at most 3 instructions into a single field element.
- Parameter passing: passing parameters to/from procedures/functions will be more complicated as compared to the stack approach (e.g., there will be a need for managing stack frames either in-memory or in some other ways).

Memory machine

Since memory access is rather cheap in zkVM context, we can do away with registers entirely and work directly with memory. As with register machines, I'm assuming that each instruction can read values from one or two memory locations and write the result of the operation into a third memory location.

Pros:
- Concise programs: there is no need for stack manipulation instructions and no need to move values between memory and stack/registers. This can save even more VM cycles as compared to register machine approach.
Cons:
- Small number of instruction formats: similar to register machine, we can probably have just a handful of instruction formats, and handling complex instructions which work with a lot of values becomes rather complicated.
- Bigger instructions: we'll need to dedicate 16 bits or so per memory reference (these could be offsets relative to some updatable value), and thus, we can have at most 1 instruction per element.
- Parameter passing: passing parameters to/from procedures/functions will be more complicated as compared to the stack approach (e.g., there will be a need for managing stack frames either in-memory or in some other ways).

Update u32 ops proptests in the processor to only test valid cases

Given that proptest may need different sampling for valid vs. invalid inputs, it makes sense to handle these separately.

As discussed, we've decided to handle error cases in manual tests and only test validity in the randomized tests.

This should be updated in the processor u32 tests here

Maybe invalid cases can be handled in manual tests? And the test validity of operations over valid inputs using randomized tests?

Originally posted by @bobbinth in #75 (comment)