Background I'm currently finishing up portions of the code generat

If you wanted to traverse every element of n</c

RFC: Extend the behavior of load/store ops to allow addressing all elements of a word about miden-vm HOT 6 OPEN

bitwalker commented on June 15, 2024

RFC: Extend the behavior of load/store ops to allow addressing all elements of a word

from miden-vm.

Comments (6)

bobbinth commented on June 15, 2024 1

That makes sense, though I guess what I was wondering, is if that changes at all with element-addressable memory vs word-addressable memory. In other words, would that change just make memory look element-addressable, but preserve how memory loads/stores work in the VM today (i.e. word-oriented, with some changes to the constraints as you described)? Or would it also involve reworking the implementation of memory in the VM so that it is element-oriented rather than word-oriented?

The approach I described above would still keep the internal mechanics of memory word-oriented. So, basically, it would just look element-addressable.

Obviously I'm looking at this through a very narrow lens, so the importance of doing things like memory accesses in word-sized units might have far reaching implications I'm not thinking about, but in an element-addressable world, it seems out of place to implement memory accesses in terms of words, particularly if mem_load/mem_store would require multiple helper registers that go essentially unused (beyond their role in the constraint system). It would obviously still be important to enforce word-alignment for mem_loadw/mem_storew, but I'm surprised it would be necessary to have the memory itself be word-oriented, if the word-alignment restriction ensures that there is always a full word at a given address.

The word-oriented design is done primarily for efficiency reasons. For the use cases we are targeting, it will be very common to read/write memory as full words rather than individual elements. For example, nodes in a Merkle tree are words, output of the RPO hash function is a word (and inputs are 3 words), assets in Miden rollup are words etc. etc. So, we frequently have a need to read or write a full word (or even two consecutive words), and I wanted to make this as efficient as possible.

Consider mem_stream for example. This instruction reads two consecutive words and overwrites top 8 elements of the stack with the result in a single cycle. Implementing this instruction with element-addressable memory would not be possible in our current design (or we'd need to add extra columns which would slow down the rest of the VM). And emulating this instruction with individual loads would probably require ~30 cycles.

from miden-vm.

bobbinth commented on June 15, 2024

Thank you for such a detailed proposal! Overall, I think this would add very nice additional capability to the VM (i.e., having fully element-addressable memory could be useful in many circumstances). A few preliminary thoughts:

Implementation within the VM

I think implementing this within the VM will be a quite tricky - but I also haven't spent a lot of time on this - so, maybe there are good solutions which I haven't considered yet.

The main issue is that we actually always have to read a full word from a memory address. It is just for cases of mem_load we put the other 3 elements into helper registers (see here). So, to select a specific element from the word, we'd need to do the following:

Put all 4 elements into helper registers.
Split the address into 2 parts - this would require putting lower limb into another helper register.
Represent the upper limb with 2 binary registers (e.g., [0, 0] means 0, [0, 1] means 1 etc.). Then we'd be able to use these registers as selectors for selecting a specific value.

Summarizing the above means that we need 7 helper registers, but currently we have only 6. I have some thoughts on how we can increase the number of helper registers to 8, but this would require a fairly significant refactoring of the decoder.

Another potential complication is that selectors would be of degree 2, and thus, the constraint for populating the top stack slot would be of degree 3. This means that MLOAD and MSTORE operation flags cannot have degree greater than 6. This should't be a problem as we currently have some empty slots for degree 5 operations. But this is still something to consider for the future.

Instruction semantics

If I understood everything correctly, the semantics you propose would be useful when addresses are static (or known at compile time). But for dynamic addresses, we'd still run into problems. For example, if we wanted to iterate over a sequence of elements in a loop, we wouldn't be able to easily increment memory address. So, it seems to me the usefulness is relatively limited.

One other idea is to treat memory as if it was element-addressable, and then impose alignment requirements on mem_loadw and mem_storew operations. So, for example:

mem_load.0 would load the first element of the first word; mem_load.1 would load the second element of the first word, mem_load.4 would load the first element of the second word etc.
mem_loadw.0 would load the first word, mem_loadw.4 would load the second word, but mem_loadw.1 would be illegal.

I think this would be much cleaner, but there are two main downsides to this approach:

We still need to solve the implementation problem I described above.
This would probably break most programs which use memory operations.

from miden-vm.

bitwalker commented on June 15, 2024

If I understood everything correctly, the semantics you propose would be useful when addresses are static (or known at compile time). But for dynamic addresses, we'd still run into problems. For example, if we wanted to iterate over a sequence of elements in a loop, we wouldn't be able to easily increment memory address. So, it seems to me the usefulness is relatively limited.

I don't think that's necessarily true, since the element offsets for all 32-bit addresses are constant, i.e. adding 0 << 32, 1 << 32, 2 << 32, or 3 << 32 to any address gives you the "paired" address for the given element of that word.

If you wanted to traverse every element of n sequential words, completely dynamically, you'd do something like the following (using pseudocode):

let base_addr: *const Word = <the address in memory of the first word, doesn't have to be constant>;
for i in 0..n {
    let word_offset: u64 = i / 4;
    let element_offset: u64 = (i % 4) << 32;
    let word_addr: u64 = base_addr as u64 + word_offset;
    let element_addr: u64 = word_addr | offset;
    let element = *(element_addr as *const Element);
    ...do stuff with element..
}

That's useful in quite a few scenarios, certainly for those in which I was intending to make use of the proposed extension.

One other idea is to treat memory as if it was element-addressable, and then impose alignment requirements on mem_loadw and mem_storew operation

I'm definitely a fan of this, as it is much cleaner right from the get go, and is similar to how on, say x86-64, certain instructions impose additional alignment requirements on their operands (e.g. SSE2 instructions), though as you mentioned it breaks backwards-compatibility. That said, I'm not sure to what extent the memory ops are used currently, so perhaps it isn't too catastrophic. I would think memory gets used pretty frequently, for precomputed constant tables and things like that, but I have no idea.

We still need to solve the implementation problem I described above.

Is that true if addresses now refer to elements not words? I would think in that case, there is no longer a need for an offset, leaving things unchanged at the instruction level. Is it because the alignment requirements that would be imposed on mem_loadw and mem_storew require additional helper registers, much in the same way my proposal would?

from miden-vm.

frisitano commented on June 15, 2024

Having native element addressable memory support at the VM level seems the superior option to me. There are a number of instances in the transaction kernel which would benefit for this functionality.

from miden-vm.

bobbinth commented on June 15, 2024

If you wanted to traverse every element of n sequential words, completely dynamically, you'd do something like the following (using pseudocode):

I didn't write out the actually assembly code for this, but just doing a rough estimation I think we'd incur extra 6 - 7 cycles per element read in this code. This is better than 9 - 10 cycles per read using padw, push.ADDR, mem_loadw, drop, drop, swap, drop - but it is not dramatically better. So, there is still a question in my mind whether the extra effort is worth ~30% improvement.

I'm not sure to what extent the memory ops are used currently, so perhaps it isn't too catastrophic.

Quite a few modules currently use memory operations, but almost all of them are written (and maintained) by us, and most of them a pretty small. There are two exceptions:

Transaction kernel which is fairly complex and relies heavily on memory. Memory access there are fairly well encapsulated and @frisitano believes that it shouldn't be too difficult to move to the new memory model.
Recursive verifier which is also fairly complex and also relies on memory quite a bit. @Al-Kindi-0 also believes that we could migrate it to the new memory model (though effort may be more significant than with transaction kernel).

Is that true if addresses now refer to elements not words?

No, addresses right now refer to words and we can read/write from/to memory only full words. The way we get around this for mem_load and mem_store instructions is by putting the things which don't end up on the stack into helper registers. So, we still read (or write) a full word, but part of the word goes into (or comes from) the helper registers.

As a side note, now that I look at the constraints of the mem_store operation it seems to me that there might be a soundness bug there (i.e., a malicious prover could inject incorrect values into the temporary registers, and overwrite the other 3 elements at a given address with arbitrary values). @Al-Kindi-0 - what do you think? If so, we might have to move back to the approach we used previously when each memory row in memory chiplet contained both old and new values.

Combing back to the design for new mem_load and mem_store operations. Their transition diagrams could look like so:

and

In the above:

$addr$ is the element address.
$waddr$ is the word address computed as $waddr = addr / 4$.
$s_0$ and $s_1$ is a binary encoding of the element index within a word. Thus, we can impose a constraint for valid address decomposition as follows: $addr \cdot 4 + 2 \cdot s_1 + s_0 = waddr$. In addition to this, we'll need constraints to enforce that $s_0$ and $s_1$ are binary, and that $add$ is a u32 value (not sure about the best way to do this yet).
$val_0, ..., val_3$ are the values currently stored at $waddr$.

In the case of MLOAD, the constraint for $val$ would be:

$$ val = (1 - s_0) \cdot (1 - s_1) \cdot val_0 + s_0 \cdot (1 - s_1) \cdot v_1 + (1 - s_0) \cdot s_1 \cdot v_2 + s_0 \cdot s_1 \cdot v_3 $$

The constraints for MSTORE seem to be somewhat more complicated - so, I'll need to think more about them. But in either case, we need to have at least 1 extra helper register.

from miden-vm.

bitwalker commented on June 15, 2024

...I think we'd incur extra 6 - 7 cycles per element read in this code. This is better than 9 - 10 cycles per read...but it is not dramatically better. So, there is still a question in my mind whether the extra effort is worth ~30% improvement.

I agree that the savings are relatively minimal, though I do think it's a matter of frequency - if such instructions occur frequently, the cycles saved start to add up fast for a program with any significant runtime. That said, the design here is a compromise for the sake of backwards-compatibility, and is really more about ergonomics/uniformity (whether the assembly is hand-written or compiler-generated). Overall though, I much prefer the approach of making memory element-addressable in general, since that seems like an option we can actually consider. The difference in cycle count for equivalent memory accesses between the two models would be night and day I think, since it removes a lot of complexity around accessing individual elements. It does impose a bit more overhead when accessing words, but only in situations where the alignment of the address is not known, and must be checked at runtime. In practice, accessing values smaller than a word in memory is going to be far more common than >= word-sized values (at least for code compiled through Wasm), so favoring those operations is going to have a bigger performance impact.

No, addresses right now refer to words and we can read/write from/to memory only full words. The way we get around this for mem_load and mem_store instructions is by putting the things which don't end up on the stack into helper registers. So, we still read (or write) a full word, but part of the word goes into (or comes from) the helper registers.

That makes sense, though I guess what I was wondering, is if that changes at all with element-addressable memory vs word-addressable memory. In other words, would that change just make memory look element-addressable, but preserve how memory loads/stores work in the VM today (i.e. word-oriented, with some changes to the constraints as you described)? Or would it also involve reworking the implementation of memory in the VM so that it is element-oriented rather than word-oriented?

Obviously I'm looking at this through a very narrow lens, so the importance of doing things like memory accesses in word-sized units might have far reaching implications I'm not thinking about, but in an element-addressable world, it seems out of place to implement memory accesses in terms of words, particularly if mem_load/mem_store would require multiple helper registers that go essentially unused (beyond their role in the constraint system). It would obviously still be important to enforce word-alignment for mem_loadw/mem_storew, but I'm surprised it would be necessary to have the memory itself be word-oriented, if the word-alignment restriction ensures that there is always a full word at a given address.

Anyway, I want to avoid derailing the conversation here just because I don't have a solid grasp of how some of these pieces fit together in the VM internals, so if I'm missing obvious stuff here (or, obvious if you know the internals well), feel free to ignore the above. I'm certainly curious about the relationship between the memory instructions and their implementation in the VM (and how that affects the constraints), but I imagine it's probably been discussed/documented somewhere already, and I just need to go do some reading.

from miden-vm.

RFC: Extend the behavior of load/store ops to allow addressing all elements of a word about miden-vm HOT 6 OPEN

Comments (6)

Implementation within the VM

Instruction semantics

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent