Continuation of off-topic discussion in <a class="issue-link js-issue-link" data-error

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Discussion on *WU instructions,about riscv/riscv-bitmanip

Comments (20)

aswaterman commented on August 16, 2024 1

Okay, I'll try to run that experiment soon. I have a flow for measuring the impact on static code size for SPEC 2006, which isn't perfect, but at least it's a lot of code.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

I've been think a lot about this the last two days. Here are some thoughts.

A *WU instruction would get rid of a single PACK whenever we convert (in C, typecast) a 32-bit unsigned int to a 64-bit in (signed or unsigned). In this case we could use *WU instead of *W for a 32-bit DAG, and could replace the final ALU+PACK with just one final ALU op.

You might ask why I'm talking about DAGs here instead of just the final operation before the cast to a 64 bit datatype. It's because of this:

uint32_t a = some_32_bit_expression + 3;
uint32_t b = another_32_bit_expression * a;
uint32_t c = a | b;
uint64_t x = c;

right now that would be:

ADDIW a0, a0, 3
MULW a1, a1, a0
OR a0, a0, a1
PACK a0, zero, a0

and with *WU:

ADDIWU a0, a0, 3
MULWU a1, a1, a0
OR a0, a0, a1

So what I'm trying to illustrate here is that this is really all-or-nothing. There are certain instructions, such as OR, that just preserve the W-ness or WU-ness or their arguments, and therefore didn't get their own *W variants. Because of this we can't just implement *WU for the most common ops that appear directly before a typecast. We have to provide *WU instructions for essentially all OP opcodes that do not preserve WU-ness. That means *WU would cost a lot of encoding space. But of course it also makes the HW decoder implementation simpler.

So far so good.

But we can't mess with the RISC-V calling convention of course. So whenever we pass an uint32_t to another function we still need to make sure it's properly sign extended.

So let's add a function call:

uint32_t f32(uint32_t);
uint32_t a = some_32_bit_expression + 3;
uint32_t b = another_32_bit_expression * a;
uint32_t c = (a | b)  ^ f(a);
uint64_t x = c;

right now that would be:

ADDIW a0, a0, 3
MULW a1, a1, a0
OR s1, a0, a1
CALL f32
XOR a0, s1, a0
PACK a0, zero, a0

And maybe I'm just not trying hard enough but I can't see any way how *WU could help here to make that sequence shorter (assuming f32() is an external function that the compiler can't see into).

Same of course if we are processing function arguments:

uint64_t foo(uint32_t a, uint32_t b) {
  uint64_t x = a, y = b;
  return x + y;
}

Right now:

foo:
PACK a0, zero, a0
PACK a1, zero, a1
ADD a0, a0, a1
RET

And nothing *WU can do for us here afaict.

Similarly, using the same 32-bit value signed in one expression and unsigned in another can mean we can't actually reduce the code size with *WU.

Idk. Am I focusing on the wrong cases? To me this looks like a compressed PACK-zero instruction would be more helpful than the entire set of *WU instructions (but idk if this is frequent enough to warrant spending a compressed instruction on it).

@asb Can you add something from the compiler perspective to this discussion?

@aswaterman Can you maybe post a few lines what added value *WU instructions would need to deliver in order to warrant the encoding space, in your opinion?

from riscv-bitmanip.

aswaterman commented on August 16, 2024

Hey @cliffordwolf, I've been meaning to get engaged in this discussion but haven't made time for it yet.

As @jhauser-us alluded to on the other thread, it's not the case that I'm strongly advocating for these instructions. It's just that their potential utility has crossed my mind, and if they are indeed worthwhile, it seems pragmatic to package them with the B extension. They deserve the same quantitative evaluation as any other compiler-targetable candidate instruction in the B extension.

My intuition is that ADD[I]WU and SUBWU are the most interesting candidates, but I agree with @jhauser-us that it's worth going through the exercise of laying out the opcode map for the entire set of candidate instructions.

I agree with you that PACK can do most of the heavy lifting.

My main motivation is to improve address-calculation sequences, because they are ubiquitous and can have a material impact on dynamic instruction count in loop-based code. For this reason, I don't concur with the "all-or-nothing" assessment: these sequences are almost never terminated in a bitwise operation, for example.

Here's a somewhat contrived example where ADDIWU helps but, as far as I can see, PACK does not:

uint64_t *array = foo();
uint32_t index = bar();
return array[index + 1];

The address-calculation part of that code compiles to

addiw a0, a0, 1
slli a0, a0, 32
srli a0, a0, 29
add a0, s0, a0

whereas instead it could compile to

addiwu a0, a0, 1
slli a0, a0, 3
add a0, s0, a0

Without the "+ 1", PACK would work equally well, of course.

Later today, I'll try to look over some real code sequences where I believe there is a benefit.

Two more notes in the mean time:

I don't think SLL[I]WU helps in address-calculation sequences, because the cast occurs prior to the shift. More generally, it could be beneficial to have instructions that zero-extend their inputs instead of their outputs. Of course, such instructions might add datapath cost, whereas the *WU instructions are essentially costless in that respect.
I am happy to volunteer time to conduct GCC-based compiler experiments for these instructions. I can also help with evaluating other aspects of the B extension (but only for the operations that the GCC middle-end already understands).

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

Hey @aswaterman

My main motivation is to improve address-calculation sequences, because they are ubiquitous and can have a material impact on dynamic instruction count in loop-based code. For this reason, I don't concur with the "all-or-nothing" assessment: these sequences are almost never terminated in a bitwise operation, for example.

Here's a somewhat contrived example where ADDIWU helps but, as far as I can see, PACK does not:

That's very helpful. Thanks.

My intuition is that ADD[I]WU and SUBWU are the most interesting candidates

I've been playing in compiler explorer a bit and I can see what you mean.

To me it looks like ADD[I]WU, SUBWU, and SLLI and ADD with one "WU-input" would be most useful.

int8_t f(int8_t array[][3], uint32_t index) {
   return array[index][0];
}

Is now:

  pack a1,zero,a1
  slli a2,a1,1
  add a1,a1,a2
  add a0,a0,a1

and could be:

  slli_rs1wu a2,a1,1
  add_rs1wu a1,a1,a2
  add a0,a0,a1

I agree with @jhauser-us that it's worth going through the exercise of laying out the opcode map for the entire set of candidate instructions.

In this context I would propose that the "entire set of candidate instructions" isn't necessarily a complete WU-mirror of W, but instead a richer set of variations (WU-inputs and/or WU-outputs) over a smaller set of underlying instructions. Most of the instructions in the current B spec I would not consider candidates. I think MINUWU/MAXUWU is the only one that would make sense to consider.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

More instructions worth debating imo: MULI MULIW MULIWU MULWU

Edit: Wait.. that list is wrong. We need the _RS1WU versions I think. You know what I mean.

from riscv-bitmanip.

aswaterman commented on August 16, 2024

The add[i]/sll[i]/sub variants that zero-extend their rs1 inputs are worth considering. Still want the add[i]/sub variants that zero-extend their outputs instead, but probably not the corresponding sll[i] ones.

For completeness I'll also mention the following option, though it is less attractive for datapath design reasons. We could instead provide shift-by-small-constant-then-add instructions, i.e., rs1 + (rs2 << {0, 1, 2, 3}). Alpha has this feature in the form of S4ADDQ and S8ADDQ, and x86 has it in the form of LEA. We could consider adding three such instructions (zero-extend rs1, sign-extend rs1, and pass rs1 through unmodified).

MULI is harder to swallow, given that it will perform worse than short shift-and-add sequences in some implementations.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

MULI is harder to swallow, given that it will perform worse than short shift-and-add sequences in some implementations.

I was thinking about MULI* in addition to the add and shift instructions, for the cases where the MUL is unavoidable.

int8_t f(int8_t array[][52], uint32_t index) {
   return array[index][0];
}

Right now:

  pack a1, zero, a1
  li a2,52
  mul a1,a1,a2
  add a1,a0,a1

Could be:

  muli_rs1wu a1,a1,52
  add a1,a0,a1

So I guess I'm proposing to consider something like the following instructions:

ADDWU
SUBWU
ADDIWU
ADD_RS1WU
SSLI_RS1WU
MUL_RS1WU
MULI
MULIW
MULI_RS1WU

The encoding space requirement for all the MULI* scare me a bit. But I would guess that all three of them would be used quite a lot in address calculation code.

from riscv-bitmanip.

aswaterman commented on August 16, 2024

My intuition is that MULI etc. won't have sufficient benefit to justify their encoding space, especially since the constant load can be hosted out of a loop. But, as usual, we should favor the quantitative approach over the intuitive one. I can help run experiments on these instructions, too.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

I agree, both with your intuition that it's probably not worth it, and that we should favor the quantitative approach over the intuitive one.

I can help run experiments on these instructions, too.

A quick initial experiment, just patching the compiler a bit to get a quick initial estimate for how much use we'd get out of those instructions, would help a lot.

If that initial experiment already shows negative results we probably don't need to bother with MULI any further. Otherwise I'd be happy to include them in the B draft so we can collect more data in a more systematic way.

Anything to add to the remaining list of candidates?

ADDWU
SUBWU
ADDIWU
ADD_RS1WU
SSLI_RS1WU
MUL_RS1WU

I think those three *_RS1WU should cover every case where we want to do something after the unsigned conversion to 64-bit and before the final LB/LW/SW/etc.

I'd guess ADDWU, SUBWU, and ADDIWU cover most of the cases where we would want to merge the conversion into the output node of the expression within [...], but we should probably consider more instructions. Obvious additional candidates imo are MULWU and SLL[I]WU, possibly even MINWU/MAXWU, but realistically I wouldn't expect any of them to be worth it.

from riscv-bitmanip.

aswaterman commented on August 16, 2024

Here are some initial results. The static instruction count savings is probably understated, because there are more tricks GCC could be taught. The static code size savings should be taken with a grain of salt because GCC doesn't do a wonderful job at RVC-aware instruction selection.

A ZEXT.W instruction (e.g. PACK) saves 0.27% static instructions and 0.11% static code size. (The code-size savings is smaller because PACK only reduces code size when one of the shifts isn't compressible.)
An SLLI_RS1WU instruction saves an additional 0.21% static instructions and 0.13% static code size (total 0.48% / 0.24%)
ADDIWU, ADDWU, and SUBWU collectively save an additional 0.10% static and 0.17% static code size, with ADDIWU accounting for a significant majority. (Presumably the code-size savings exceeds the instruction-count savings because these instructions changed some other code-generation decision.)
An SLLIWU instruction has almost no discernible effect.
ADDI_RS1WU and ADD_RS1WU are not much better--less than 0.01% in both metrics.
I did not look at any of the MUL instructions, because they account for under 0.2% of static instructions in SPEC. Given that only a few percent could be optimized, they wouldn't make a measurable difference in these static metrics.

One way of thinking about this is that ZEXT.W provides only about half of the benefit that SLLI_RS1WU, ADDIWU, ADDWU, and SUBWU collectively provide.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

Thanks for performing the test. I think that helps a lot.

Right now in the B draft spec, C.NOT operates on rd, not rd'. I've been thinking about changing this for RV64/RV128 so that it operates on rd' there, with free space to add compressed ZEXTW (RV64/RV128) and ZEXTD (RV128). Your thoughts?

They say never do math in public. Especially when using dodgy approximating assumptions. I'll ignore that advise for now. Large likelihood of facepalm-worthy math mistakes ahead..

(Those percentages are all very close to 0%, so I'll just act as if percentages add, i.e. I'll assume a 0.5% reduction ontop of a 0.5% reduction is a 1% change, when in fact it's only a 0.9975% change.)

ZEXT.W instruction (e.g. PACK) saves 0.27% static instructions

Since one ZEXTW replaces two shift, I'd assume this means 0.27% static instructions of the compiler output is ZEXTW instructions.

Assuming 50% of overall instructions are compressed, 0.27% static instructions that are all uncompressed means approx 0.40% of static code size (0.27*1.5=0.405). Therefore, adding C.ZEXTW should further reduce code size by 0.20%.

So combined with the gain from an uncompressed ZEXTW, adding ZEXTW+C.ZEXTW should get us a static code size improvement of 0.31%.

SLLI_RS1WU instruction saves an additional 0.21% static instructions and 0.13% static code size

If we compress ZEXTW, then combining ZEXTW+SLLI is 50% less effective. So I guess this would then only be a 0.065% improvement in static code size.

ADDIWU, ADDWU, and SUBWU collectively save an additional 0.10% static and 0.17% static code size

Same argument as above should give us 0.085% improvement for ADDIWU, ADDWU, SUBWU.

So in summary, only looking at code size reduction:

Extensions	Uncompressed ZEXTW	Compressed ZEXTW
ZEXT	0.110%	0.310%
SLLI_RS1WU	0.130%	0.065%
ADDWU	0.170%	0.085%

And accumulative:

Extensions	Uncompressed ZEXTW	Compressed ZEXTW
ZEXT	0.110%	0.310%
ZEXT+SLLI_RS1WU	0.240%	0.375%
ZEXT+SLLI_RS1WU+ADDWU	0.410%	0.460%

Or, assuming SLLI_RS1WU and ADDWU are independent:

Extensions	Uncompressed ZEXTW	Compressed ZEXTW
ZEXT	0.110%	0.310%
ZEXT+ADDWU	0.280%	0.395%
ZEXT+ADDWU+SLLI_RS1WU	0.410%	0.460%

with ADDIWU accounting for a significant majority.

I think C.ZEXTW+ADDIWU (maybe even without ADDWU/SUBWU) sound like an interesting option. It's fairly simple, easy to implement, and gets us pretty close to the results of the "everything" option.

I'd prefer only-ADDIWU over only-SLLI_RS1WU because it's the simpler instruction, easier to implement, and possibly slightly more effective. ADDIWU requires more encoding space than SLLI_RS1WU, but on the other hand it's a very simple encoding.

PS: To be honest I'm kinda disappointed that SLLI_RS1WU seems to perform so well in your test. It's such an ugly instruction. So I might be a bit biased against poor SLLI_RS1WU. :)

from riscv-bitmanip.

jnk0le commented on August 16, 2024

PS: To be honest I'm kinda disappointed that SLLI_RS1WU seems to perform so well in your test. It's such an ugly instruction. So I might be a bit biased against poor SLLI_RS1WU. :)

I think that might be related mostly to this issue, it is equivalent to about 2-3 instructions in indirect addressing scenario.

I'm curious how would those WU and RS1WU instructions compare to:

a) just bfxp48 but with final rs2 addition instead of insertion
b) R3 type instruction that somehow fits 32 bit encoding and behaves similarly to armv7 uxta/sxta but with extra small-constant-leftshift before adding to rs2

from riscv-bitmanip.

aswaterman commented on August 16, 2024

Add-with-small-shift will increase latency in some implementations, so is a harder sell.

SLLI_RS1WU is ungainly for sure. Should look at dynamic instruction counts before ruling it out, though.

C.NOT targeting rd instead of rd’ seems like overkill. Having a ZEXT there will probably help more.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

ADDI_RS1WU and ADD_RS1WU are not much better--less than 0.01% in both metrics.

That result kind of puzzles me. Is it possible that this code base contains no byte-arrays? (char-arrays? strings?) I would have expected ADD_RS1WU utilization for that.

void example(char *dst, char *src) {
  unsigned i = 0;
  while (src[i]) dst[i] = src[i] ^ 6, i++;
  dst[i] = 0;
}

(ADDI_RS1WU I would have expected to be worthless.)

from riscv-bitmanip.

aswaterman commented on August 16, 2024

Yeah, not sure what the explanation is. Not too many byte arrays surely has something to do with it. In your particular example, GCC makes a strange choice, but if I simplify the example to use only one array, it does use ADD_RS1WU. So it's not that the compiler is broken.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

(I'm now calling the instructions that zero-extend at the input *UW and the ones that zero-extend at the output *WU. Not perfect, but prettier than *_RS1WU.)

My current thinking is to include the following instructions in the draft spec and then see what a more thorough evaluation will show:

ADDIWU
ADDWU, SUBWU  (?)
ADDUW, RSUBUW
SLLIUW

Afaict ADDUW, RSUBUW and SSLIUW should actually be sufficient for all the address calculation cases that don't need MUL. Am I missing something? Would be interesting to see what results you get with just those three instructions.

Even though we don't "subtract" in address calculations after the cast, we need RSUBUW because we might want to subtract the un-shifted value as part of the shift-add/sub pattern for simple multiplies.

And it needs to be RSUB, otherwise we'd zero-extend the wrong argument. Or we move the zero extend to rs2 for ADD/SUB (but of course leave it on rs1 for SLL).

I'm ambivalent about ADDWU and SUBWU.

ADDIWU, ADDWU, and SUBWU collectively save ..., with ADDIWU accounting for a significant majority.

How significant? I'm currently leaning a bit towards including ADDIWU, but not bothering with ADDWU and SUBWU.

from riscv-bitmanip.

aswaterman commented on August 16, 2024

Like 70%, IIRC. ADDWU and SUBWU are very cheap to provide in terms of both opcode space, and including them has no effect on the datapath and will likely reduce control cost. So I think they should be included if ADDIWU is included.

RSUBUW is a lot more costly than the other instructions, as it requires conditionally inverting an ALU input that doesn't currently require it. Given that addition is vastly more common than subtraction in address calculations, I don't think it can be justified. I think RSUBUW also has a special case for when rs1[31:0] == -1; need to suppress the carry into bit 32.

ADDUW is quite a bit easier to swallow than RSUBUW, as the zero-extension can be folded into the operand muxing at only control cost.

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

RSUBUW is a lot more costly than the other instructions, as it requires conditionally inverting an ALU input that doesn't currently require it.

That's why I suggested performing the zero-extension on rs2. XLEN/2 ANDs is significantly cheaper than XLEN XORs.

Given that addition is vastly more common than subtraction in address calculations, I don't think it can be justified.

One address calculation has at most one RSUBUW, but can have multiple ADDUW. But if you need an explicit zero-extend for the SUB then there is no gain from having the ADD with built-in zero-extend.

I think RSUBUW also has a special case for when rs1[31:0] == -1; need to suppress the carry into bit 32.

I don't think so. The following seems to do the expected thing when compiled and run, and passes formal verification for all x values with cbmc --function demo demo.c:

#include <stdio.h>
#include <stdint.h>
#include <assert.h>

int64_t ssliuw(int64_t rs1, int shamt)
{
        int64_t rs1_zext = (uint32_t)rs1;
        return rs1_zext << shamt;
}

int64_t rsubuw(int64_t rs1, int64_t rs2)
{
        int64_t rs1_zext = (uint32_t)rs1;
        return rs2 - rs1_zext;
}

void demo(int32_t x)
{
        int64_t sx = x;
        int64_t ux = (uint32_t)x;

        printf("X:   0x%016llx\n", (long long)x);
        printf("SX:  0x%016llx\n", (long long)sx);
        printf("UX:  0x%016llx\n", (long long)ux);

        int64_t ref = ux*7;
        int64_t tst = rsubuw(sx, ssliuw(sx, 3));

        printf("REF: 0x%016llx\n", (long long)ref);
        printf("TST: 0x%016llx\n", (long long)tst);

        assert(ref == tst);
}

int main() {
        demo(-1);
        return 0;
}

Like 70%, IIRC. ADDWU and SUBWU are very cheap to provide in terms of both opcode space, and including them has no effect on the datapath and will likely reduce control cost. So I think they should be included if ADDIWU is included.

I was hoping for more like 90%. I'm not worried about opcode space or datapath complexity. More about finding nice encodings for all those ADD/SUB instructions without trampling over yet untouched funct7 space. But I think I have a good plan for how to add ADDWU and SUBWU. I will write it up later today.

from riscv-bitmanip.

aswaterman commented on August 16, 2024

You’re right, there’s no special case, you just need to perform the truncation before the complement when computing rs1 + ~(trunc)rs2 + 1

…

On Fri, May 17, 2019 at 2:14 AM Clifford Wolf ***@***.***> wrote: RSUBUW is a lot more costly than the other instructions, as it requires conditionally inverting an ALU input that doesn't currently require it. That's why I suggested performing the zero-extension on rs2. XLEN/2 ANDs is significantly cheaper than XLEN XORs. Given that addition is vastly more common than subtraction in address calculations, I don't think it can be justified. One address calculation has at most one RSUBUW, but can have multiple ADDUW. But if you need an explicit zero-extend for the SUB then there is no gain from having the ADD with built-in zero-extend. I think RSUBUW also has a special case for when rs1[31:0] == -1; need to suppress the carry into bit 32. I don't think so. The following seems to do the expected thing when compiled and run, and passes formal verification for all x values with cbmc --function demo demo.c: #include <stdio.h> #include <stdint.h> #include <assert.h> int64_t ssliuw(int64_t rs1, int shamt) { int64_t rs1_zext = (uint32_t)rs1; return rs1_zext << shamt; } int64_t rsubuw(int64_t rs1, int64_t rs2) { int64_t rs1_zext = (uint32_t)rs1; return rs2 - rs1_zext; } void demo(int32_t x) { int64_t sx = x; int64_t ux = (uint32_t)x; printf("X: 0x%016llx\n", (long long)x); printf("SX: 0x%016llx\n", (long long)sx); printf("UX: 0x%016llx\n", (long long)ux); int64_t ref = ux*7; int64_t tst = rsubuw(sx, ssliuw(sx, 3)); printf("REF: 0x%016llx\n", (long long)ref); printf("TST: 0x%016llx\n", (long long)tst); assert(ref == tst); } int main() { demo(-1); return 0; } Like 70%, IIRC. ADDWU and SUBWU are very cheap to provide in terms of both opcode space, and including them has no effect on the datapath and will likely reduce control cost. So I think they should be included if ADDIWU is included. I was hoping for more like 90%. I'm not worried about opcode space or datapath complexity. More about finding nice encodings for all those ADD/SUB instructions without trampling over yet untouched funct7 space. But I think I have a good plan for how to add ADDWU and SUBWU. I will write it up later today. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14?email_source=notifications&email_token=AAH3XQS65CCH7TZJQMBVDQLPVZZO7A5CNFSM4HMK24H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVUHBOY#issuecomment-493383867>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH3XQWNG64MPTEERLCHSETPVZZO7ANCNFSM4HMK24HQ> .

from riscv-bitmanip.

cliffordwolf commented on August 16, 2024

The latest draft now contains ADDWU, SUBWU, ADDIWU, ADDU.W, SUBU.W, and SLLIU.W instructions that address above issues with 64-bit address calculation with 32-bit indices. Looking forward to a more detailed analysis when there's proper compiler support.

from riscv-bitmanip.

Discussion on *WU instructions about riscv-bitmanip HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent