Coder Social home page Coder Social logo

riscv / riscv-bitmanip Goto Github PK

View Code? Open in Web Editor NEW
207.0 49.0 66.0 124.42 MB

Working draft of the proposed RISC-V Bitmanipulation extension

Home Page: https://jira.riscv.org/browse/RVG-122

License: Creative Commons Attribution 4.0 International

Makefile 100.00%

riscv-bitmanip's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

riscv-bitmanip's Issues

Build riscv64-unknown-linux-gnu- tools

Is there anyway to build the linux toolchain with B support?

I took a look at the scripts and it looks like you could add a compile for the linux tools(??)

minor bitmanip patch required on next gcc rebase

I just committed a patch upstream to optimize a zero-extend followed by an array indexing left shift. This was sometimes three instructions, now it is two. This added a pattern zero_extendsidi2_shifted that is identical to the bitmanip branch slliuw pattern, except that it splits into two shifts instead of emitting an slliuw instruction. In order for the slliuw pattern to continue working, this upstream pattern will need a ! TARGET_BITMANIP check added to its condition. This issue is just to document the problem for when we eventually rebase later.

Make instruction names consistent for B and V extensions

In a few instances, the current drafts for the B and V extensions have instructions that give different names to the same operation.

The B extension has PCNT, while the V extension has an instruction called VMPOPC, where "POPC" also stands for population count.

The B extension has ANDC, while the V extension has VMANDNOT and VMORNOT.

For the examples above, one or the other extension (or both) must be changed to avoid gratuitous inconsistencies.

Furthermore, the B extension has CMOV, while the V extension has VMERGE that performs the same function element-wise for vectors. (My preference would be to have SELECT and VSELECT, or SEL and VSEL, but if those are impossible, I propose renaming CMOV as MERGE.)

Finally, the B extension has BEXT which operates on bits, while the V extension has VCOMPRESS that performs the same function on vector elements.

Missing library file "vsupport.h"

I'm trying to compile examples with the compiled toolchain from the riscv repository. However, I could not find the said file in any repository here. From where does it come from?

Doesn't work sh1addu.w

In rvb_simple EU:
When execute sh1addu.w shadd_active == 1 and wuw_active == 1 at the same time.
So rd = shadd_out | wuw_dout.

wuw_dout == 1 because din_insn14 == 0. So din_insn14 == 1 in sh2addu.w and sh3addu.w and they work.

Exceptions

@cliffordwolf
Hi Clifford
for an XLEN=32 or XLEN=64 Implementation, should the following instruction raise an exception ?

sbseti x1, x2, 127

Thx
Lee

Discussion on *WU instructions

Continuation of off-topic discussion in #10.

Quick summary:

@jhauser-us:

[T]here's a whole category of instructions that would have more impact but aren't currently included, and that is the unsigned equivalents of the existing RV64I *W instructions: ADDWU, SUBWU, SLLWU, etc. These would be just like the existing *W instructions but instead zeroing the upper 32 bits, as appropriate for an unsigned int or uint32_t result type rather than int or int32_t. We hardly need to run any experiments to know that such *WU instructions would be used far more frequently than the *W instructions proposed for the B extension.

@cliffordwolf:

The idea behind *W instructions is simply that they operate on the lower 32 bits. It makes sense to have a consistent scheme for how to fill the upper bits, but it doesn't matter much what this scheme is exactly, if that scheme is sign-extend or zero-extend.

@brucehoult:

If you provide *W operations you have to decide what to do with the upper bits. You can leave then alone (x86), zero extend them (Aarch64), or sign extend them (RISC-V). If you leave them alone then casting a 32 bit value to a 64 bit value requires a sext or zext every time. If you sign extend them then only unsigned values require a zext, signed ones are already correct. If you zero extend them then only signed values require a sext, unsigned ones are already correct. It's hard to say which is better. Most normal application code uses signed more than unsigned, favouring sign extension

@jhauser-us:

Andrew has told me he wants to keep open the option for *WU instructions for now, which implies to me we should devise a system to reserve the encoding space now, even if the idea eventually gets dropped. I haven't run this particular system by him yet, but I intend to do so soon.

Naturally, we would want any system that's adopted to be fully compatible with the B extension, by tweaking either or both as necessary. I'll be looking into this question soon. And now you or anyone else can do so too, if you're so inclined.

But this is getting off-topic for this GitHub issue, so we should move the discussion elsewhere if you'd like to continue.

@brucehoult:

I agree that there is merit in systematically adding *WU versions of R-type instructions that have *W versions. If you do it at all then it should be for BOTH the base instruction set and for any new *W instructions in the BitManip extension.

This is easily affordable in terms of opcode space by, as has been pointed out, using something in the hi bits of the instruction, keeping the identical opcode and func3 to the *W version. It's also very cheap to implement.

Probably the only OP-IMM-32 instruction that can be justified is ADDIWU. That does need a new opcode.

@jhauser-us:

Interestingly, in a message to me yesterday where we talked about possible *WU instructions, Andrew literally wrote:

I’m not opposed to putting them in B.

To be sure, that expresses ambivalence more than advocacy. But as far as "could be persuaded" goes, I believe yes.

After that I conceded that there is a reasonable way to add *WU instructions and described the encoding that I would prefer.

I hope everyone feels like this summary treats them fairly. Please post your corrections below.

Need better support for signed bytes and halfwords?

With the current draft of the B extension, RISC-V has single instructions that can be used to zero-extend a byte (ANDI), halfword (PACKW), or word (PACK or ADDIWU) to full 64-bit register width. A word can also be sign-extended in one instruction (ADDIW). But, unless I'm mistaken, there is still no single instruction that can sign-extend a byte or halfword to 64 bits. If there are constituencies out there that make frequent use of signed char and short types (embedded applications with limited memory, perhaps?), such instructions might get more use overall than others that are being included.

Similarly, for reading big-endian data, we currently have the ability in only two instructions to load and byte-swap an unsigned halfword (LHU + BSWAP.H), an unsigned word (LWU + BSWAP.W), or a signed word (LWU + GREVIW), but not a signed halfword, which takes three instructions.

bitmanip opcode encoding table: FSRI overlap SBEXTI, GORCI, GREVI, RORI, SROI, SRAI, SRLI?

Hi,
in the opcode encodings table, FSRI shows that bit 26 needs to be 1. Can we add to the table that SBEXTI, GORCI, GREVI, RORI, SROI, SRAI, SRLI bit 26 needs to be 0? Otherwise it looks like there is overlap in the encodings. I see the text above the table mentions op[26]=1 selects funnel shifts, but it might be helpful to show in the table as well.

Thanks,
Dan

Suggest renaming GREVI pseduo-instructions to mirror ZIP/UNZIP

The pseudo-instructions defined for GREVI (BREV.P, PSWAP.N, etc.) are like the ZIP and UNZIP pseudo-instructions in that they move "units" of a power-of-two size within "components" of a larger power-of-two size. For the ZIP/UNZIP pseudo-instructions, there is a simple pattern of

ZIP<unit-size><component-suffix>
UNZIP<unit-size><component-suffix>

In this system, <unit-size> is either empty, meaning 1 bit, or is a decimal number of bits ("2", "4", "8", or "16"); and <component-suffix> is either empty, meaning the full register size (XLEN), or is one of the suffixes '.N', '.B', '.H', or '.W'.

It would aid comprehension if the GREVI pseudo-instructions followed the same system. I propose

REV<unit-size><component-suffix>

This would rename all of the GREVI pseudo-instructions as follows:

BREV.P  -> REV.P
PSWAP.N -> REV2.N
BREV.N  -> REV.N
NSWAP.B -> REV4.B
PSWAP.B -> REV2.B
BREV.B  -> REV.B
BSWAP.H -> REV8.H
NSWAP.H -> REV4.H
PSWAP.H -> REV2.H
BREV.H  -> REV.H
HSWAP.W -> REV16.W
BSWAP.W -> REV8.W
NSWAP.W -> REV4.W
PSWAP.W -> REV2.W
BREV.W  -> REV.W
WSWAP   -> REV32
HSWAP   -> REV16
BSWAP   -> REV8
NSWAP   -> REV4
PSWAP   -> REV2
BREV    -> REV

If the name "BSWAP" is so entrenched that we feel we must have this mnemonic, then BSWAP can be made another pseudo-instruction alias for REV8.

preprocessor macro for bitmanip

The gcc patch should define a preprocessor macro so end users can check to see if bitmanip support is enabled for the target. I would suggest __riscv_bitmanip since the rest of the code seems to be using bitmanip consistently.

See also the riscv/riscv-c-api-doc repo where we are documenting C API issues like preprocessor macros. I'm filing a pull request there to suggest __riscv_bitmanip which can be changed if someone has a better suggestion.

Bring sanity to source operand order for ternary instructions

Please, please, please, in the assembly language for the ternary instructions, do not place the control operand of CMIX between the other two source operands, do not place the condition operand of CMOV between the other two source operands, and do not place the shift amount for funnel shifts (FSL and FSR) between the other two source operands.

I understand the hardware motivation for having the control and shift amounts be in rs2, which forces the other two operands to be rs1 and rs3. But it would be better to define the assembly language for these instructions as

CMIX rd,rs2,rs1,rs3
CMOV rd,rs2,rs1,rs3
FSL  rd,rs1,rs3,rs2
FSR  rd,rs1,rs3,rs2

Whatever extra trouble a nonlinear operand order might cause for tools authors, it is nothing compared to the multiplicative effect of foisting an illogical order on programmers. Let's not forget, there are literally thousands of programmers for every tools author, and we'd prefer as often as possible to help those programmers write bug-free code.

I note that assembly language pseudo-instructions already provide some precedent for breaking a definite connection between operand order and source register numbers. Store instructions are another existing exceptional case, being written as

SW rs2,offset(rs1)

and not

SW rs1,rs2,offset

On some INSTW or INST.W instruction definitions

There are some INSTW or INST.W instructions, such as:

packw rd, rs1, rs2
,
{\tt addu.w} and {\tt subu.w} are identical to {\tt add} and {\tt sub}, except
.

W instructions are mean to keep 32bit computations on RV64 machine, is this property valid?

  1. Property A: INST.W instructions at RV64 generate the same result as INST ones at RV32.
  2. Porperty B: INST.W instructions always sign-extended the lower 32bit of the result.

In the current bitmanip specs, these two properties seem not be valid, for example, shnaddu.w, adduw, subuw returns the 64 bit results not the sign-extended lower 32bit.

spike outputs "args unknown" on bitmanip instruction

Hi,

I got below output when using spike to run B extension.

core 0: 0xffffffff80001742 (0x60191b93) ctz (args unknown)
core 0: 0xffffffff80001746 (0x61a01a33) rol (args unknown)
core 0: 0xffffffff8000174a (0x003199a3) sh gp, 19(gp)
core 0: 0xffffffff8000174e (0x406eeeb3) orn (args unknown)
core 0: 0xffffffff80001752 (0x41f6fbb3) andn (args unknown)
core 0: 0xffffffff80001756 (0x0aeef133) maxu (args unknown)

Here is my command.
spike --isa=rv32imcb -l test.o

Any suggestion? Thanks

About register content and operand order in pack/packu/packh

This is ARM pack instruction
PKHBT Rd, Rn, Rm ## Rd = Rm[31:16]|Rn[15:0], Bottom of Rn, Top of Rm
PKHTB Rd, Rn, Rm ## Rd = Rn[31:16]|Rm[15:0], Top of Rn, Bottom of Rm
I can easily tell which part is taken from Rn/Rm.

In pack instruction format:
pack rd, rs1, rs2

rd = rs2[15:0]|rs1[15:0]

kind of reverse to the order in assembly express.

Also in funnel shift
fsr rd, rs1, rs3, rs2

tmp[63:0] = [rs3, rs1] >> rs2

rd = tmp[31:0]

Is it a little-endian concept?

Add sext.h and sext.b (Zbb)

Moving this discussion here from
https://groups.google.com/a/groups.riscv.org/forum/?utm_medium=email&utm_source=footer#!msg/isa-dev/0emw3Y8ZNxY/eUT5_IzaAwAJ.

The proposal is to add dedicated sext.h and sext.b instructions.

uint_xlen_t sext_h(uint_xlen_t rs)
{
    int shamt = XLEN - 16;
    return sra(sll(rs, shamt), shamt);
}

uint_xlen_t sext_b(uint_xlen_t rs)
{
    int shamt = XLEN - 8;
    return sra(sll(rs, shamt), shamt);
}

The encoding cost would be minimal because these are unary instructions.

The hardware cost would be acceptable, if the instruction is being used.

The main argument for this instruction is that the RISC-V calling convention requires arguments < 32 bit to be sign/zero extended according to their type. For example:

extern "C" int foo(short);

int bar(int a, int b) {
    return foo(a+b);
}

Is being compiled to the following without B extensions:

bar(int, int):
        addw    a0,a0,a1
        slliw   a0,a0,16
        sraiw   a0,a0,16
        tail    foo

And could be compiled to the following with sext.w:

bar(int, int):
        addw    a0,a0,a1
        sext.h  a0,a0
        tail    foo

The expectation is that function arguments < 32 bit may be common in code that is ported to RISC-V from smaller 8-bit or 16-bit micro controllers.

With those instructions added we would be able to zero-extend or sign-extend any 8-, 16-, or 32-bit value in a single instruction:

Width sign ext zero ext
8 sext.b rd,rs andi rd,rs,255
16 sext.h rd,rs pack[w] rd,rs,zero
32 addw rd,rs,zero pack rd,rs,zero

Most *W instructions don't have sufficient justification

One of the frequently asked questions listed in the document is:

Do we really need all the *W opcodes for 32 bit ops on RV64?

In my opinion, the only *W instructions in the B extension that might have significant value are the rotate instructions, RORW, ROLW, RORIW, and maybe PACKW. To help decide, the document proposes running "proper experiments with compilers that support those instructions". That would be ideal of course, but I'm skeptical the community will wait on the B extension long enough for that to happen. (Plus it's not exactly easy to set up unbiased experiments for these kinds of specialized features.)

In the meantime, I'd like to point out that most of the proposed *W instructions can be substituted by a sequence of only 2 or 3 other instructions. The following are believed to be equivalent sequences (not always unique):

CLZW rd,rs

SLOI rd,rs,32
CLZ rd,rd

CTZW rd,rs

LI temp,-1
PACK rd,temp,rs
CTZ rd,rd

PCNTW rd,rs

PACK rd,zero,rs
PCNT rd,rd

SLOIW rd,rs1,i

SLOI rd,rs1,i
SEXT.W rd,rd

SLOW rd,rs1,rs2

ANDI temp,rs2,31
SLO rd,rs1,temp
SEXT.W rd,rd

SROIW rd,rs1,i

If i > 0:

SLLI rd,rs1,32
SROI rd,rd,(i+32)

If i = 0:

SEXT.W rd.rs1

SROW rd,rs1,rs2

NOT temp,rs1
SRLW rd,temp,rs2
NOT rd,rd

GREVIW rd,rs1,i

GREVI rd,rs1,i
SEXT.W rd,rd

GREVW rd,rs1,rs2

ANDI temp,rs2,31
GREV rd,rs1,temp
SEXT.W rd,rd

SHFLIW rd,rs1,i

SHFLI rd,rs1,i
SEXT.W rd,rd

SHFLW rd,rs1,rs2

ANDI temp,rs2,15
SHFL rd,rs1,temp
SEXT.W rd,rd

UNSHFLIW rd,rs1,i

UNSHFLI rd,rs1,i
SEXT.W rd,rd

UNSHFLW rd,rs1,rs2

ANDI temp,rs2,15
UNSHFL rd,rs1,temp
SEXT.W rd,rd

BEXTW rd,rs1,rs2

If rs2 is a known constant, rs2[63:32] = 0, and at least one bit in rs2[31:0] is a zero (very likely):

BEXT rd,rs1,rs2

If rs2 is a known constant, rs2[63:32] != 0 (unlikely), and at least one bit in rs2[31:0] is a zero:

PACK temp,zero,rs2
BEXT rd,rs1,temp

If rs2 is a known constant and rs2[31:0] = 0xFFFFFFFF (unlikely):

SEXT.W rd,rs1

If rs2 is not a known constant:

PACK temp,zero,rs2
BEXT rd,rs1,temp
SEXT.W rd,rd

BDEPW rd,rs1,rs2

If rs2 is a known constant and rs2[63:31] = 0:

BDEP rd,rs1,rs2

Else:

BDEP rd,rs1,rs2
SEXT.W rd,rd

CLMULW rd,rs1,rs2

CLMUL rd,rs1,rs2
SEXT.W rd,rd

FSLW rd,rs1,rs2,rs3

PACK rd,rs3,rs1
FSL rd,rd,rs2,rd
SEXT.W rd,rd

FSRW rd,rs1,rs2,rs3

PACK rd,rs1,rs3
FSR rd,rd,rs2,rd
SEXT.W rd,rd

Note that a final SEXT.W rd,rd can be eliminated if the rd result is known to be used only in subsequent 32-bit operations (such as SW or other *W instructions). Other optimizing tweaks are also possible, depending on the circumstance.

Unless the proposed *W instructions can be shown to be much more prevelant than I expect, the combination of rare utility plus relatively easy synthesis from other instructions argues strongly for dropping them.

The document also says:

But they add very little complexity to the core. So the only question is if it is worth the encoding space.

While "very little complexity" may be true, I disagree that it should be dismissed and only encoding space considered. "Very little complexity" certainly lowers the threshold of utility an instruction must demonstrate to be acceptable, but it doesn't make the instruction free to add. There are many other possible instructions of very little complexity that we so far choose to exclude, and these *W instructions perhaps should be among them.

For instance, there's a whole category of instructions that would have more impact but aren't currently included, and that is the unsigned equivalents of the existing RV64I *W instructions: ADDWU, SUBWU, SLLWU, etc. These would be just like the existing *W instructions but instead zeroing the upper 32 bits, as appropriate for an unsigned int or uint32_t result type rather than int or int32_t. We hardly need to run any experiments to know that such *WU instructions would be used far more frequently than the *W instructions proposed for the B extension.

Rename ADDUW, SUBUW, SLLIUW

The new "prefix zero-extend" instructions currently have names that don't follow the existing convention for *W instructions. In particular, these new instructions do not act at all like instructions DIVUW and REMUW, which perform their operation on two unsigned 32-bit values and then sign-extend the 32-bit result.

To avoid confusion, the new instructions need different names. I propose

ADDZX
SUBZX
SLLZXI

where "ZX" stands for "zero-extend". For example, instruction

SLLZXI rd,rs1,i

acts the same as the sequence

ZEXT.W rd,rs1
SLLI rd,rd,i

grev/grevi issue with rs2 vs imm

It starts off by listing both grev ... rs2 and grevi ... imm.

Then the first paragraph says "It takes in a single register value and an immediate ..." which conflicts with the above and suggests the last operand must be an immediate.

Then the second paragraph says "This operation iteratively checks each bit i in rs2 ..." which suggests that the last operand must be a register.

SLLI, SRLI, ... ROR: Confusion about encoding

Hi all,

I don't understand the encoding for the SLLI, SRLI, ... ROR family of instructions.
Bitmanip v0.9 spec, page 35:

image

The SLLI* instruction aforementioned is not defined this way in the spec. There
is a 6-bit immediate field, not 7, as shown below for rv64i (page 30 of ISA spec v2.2):

sllrv64-isa-spec

Moreover the assembler code treats the first two fields like a funct6, followed by a 6-bit immediate:

... opcode/riscv.h:
#define OP_MASK_SHAMT		0x3f
#define OP_SH_SHAMT		20
... riscv-opc.c:
#define USE_BITS(mask,shift)    (used_bits |= ((insn_t)(mask) << (shift)))
...
      case '>': USE_BITS (OP_MASK_SHAMT,        OP_SH_SHAMT);   break;

Could someone please explain what's going on here? And if this is correct, should it not have its own encoding format?

Incorrect operands type for bswapdi2 in gcc/config/riscv/bitmanip.md

Hello,

Currently bswapdi2 is defined in bitmanip.md as follows:

(define_insn "bswapdi2"
  [(set (match_operand:SI 0 "register_operand" "=r")
	(bswap:SI (match_operand:SI 1 "register_operand" "r")))]
  "TARGET_64BIT && TARGET_BITMANIP"
  "grevi\t%0,%1,0x38"
  [(set_attr "type" "bitmanip")])

Looks like SI should be replaced with DI in this definition.

My test was a function listed below (originally defined in libgcc/libgcc2.c):

typedef long DItype;

DItype
__bswapdi2 (DItype u)
{
  return ((((u) & 0xff00000000000000ull) >> 56)
          | (((u) & 0x00ff000000000000ull) >> 40)
          | (((u) & 0x0000ff0000000000ull) >> 24)
          | (((u) & 0x000000ff00000000ull) >>  8)
          | (((u) & 0x00000000ff000000ull) <<  8)
          | (((u) & 0x0000000000ff0000ull) << 24)
          | (((u) & 0x000000000000ff00ull) << 40)
          | (((u) & 0x00000000000000ffull) << 56));
}

Before replacement of SI with DI the function assembly was broken:

__bswapdi2:
        addi    sp,sp,-16
        sd      ra,8(sp)
        call    __bswapdi2    /// <-- infinite recursion
        ld      ra,8(sp)
        addi    sp,sp,16
        jr      ra

After replacement it looks good:

__bswapdi2:
        grevi   a0,a0,0x38
        ret

Please, make a fix.

gcc experimenting

I tried adding gcc optimization support for the b extension. This is one day of work, so I only added the easy ones, didn't verify the results with execution, and haven't tried to handle every case. The assembler is missing support for the rev and zext aliases but I can emit the pack and grevi instructions for those. The assembler is missing support for the addwu, subwu, addu.w, subu.w, and slliu.w instructions, so those are disabled though I am able to generate them.

This patch doesn't affect dhrystone, but for coremark I see a 280 byte reduction in size, which is about 1.5%, with 99 pack instructions and 3 max instructions. Then I realized I had the signed ee_u32 hack in my tree, so I tried undoing that. Now I see a 384 byte reduction in size, which is about 2%, with 190 pack instructions, 3 max instructions, and 1 maxu instruction. We can perhaps get better results with support for the missing addwu etc instructions.

gcc-b-support.patch.txt
tmp.c.txt

maybe use pcnt(rs1 ^ rs2) ?

In https://groups.google.com/d/msg/comp.arch/8MR8_O-wCeE/8pyiGYz8AQAJ Pedro Pereira suggests:

In the latest bitmanip extension document, the popcount opcode is defined as:

rd = pcnt(rs)

a more useful primitive would be:

rd = pcnt(rs1 ^ rs2)

Since the RISC-V has a zero register (x0), the suggested version could
encode the first one as "rd = pcnt(rs ^ x0)".

I don't imagine that reading one extra register and
performing a xor would make the instruction need more cycles.

rvintrin.h file name

This is a very general name. Other extensions will also want intrinsics files. Perhaps B extension intrinsics should be in a file with a name like rvb-intrin.h to make it clear that these are RISC-V B extension intrinsics.

If we are putting all intrinsics in one file, then we may need to conditionalize them based on whether that particular extension is enabled. That requires a macro as per issue #28.

Should shift immediate for SLLIU.W be 6 bits?

The document says

slliu.w is identlical to slli, except that bits XLEN-1:32 of the rs1 argument are cleared before the shift.

However, the proposed encoding for SLLIU.W shows another difference: Unlike the RV64 SLLI, the shift immediate for SLLIU.W is only 5 bits, supporting a maximum shift of 31 bits. If this was intentional, the limitation should be explained. Otherwise, it looks like the encoding will need to be changed to accomodate a 6-bit shift for RV64 and, I presume, a 7-bit shift eventually for RV128.

REPACK[I] (Zbf) and PACKB (Zbb)

Reflecting on https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/0emw3Y8ZNxY I'd like to propose two new instructions for packing structs of bitfields and bytes.

Note that the RISC-V calling convention requires structs that fit in a register to be passed in a register when passed by value. This means that in the worst case we need to pack that register on each function call.

REPACK

First, a REPACK[I] instruction (in Zbf) with the following semantic would help improve the performance of bitfield packing.

uint_xlen_t repack(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int shamt = rs2 & (XLEN-1);
    uint_xlen_t lower = (rs1 << XLEN/2) >> XLEN/2;
    uint_xlen_t upper = (rs2 >> XLEN/2) << shamt;
    uint_xlen_t mask = ~uint_xlen_t(0) << shamt;
    return (upper & mask ) | (lower & ~mask);
}

That is, take the upper half of rs1, and place it over the lower half of rs1 at the offset specified at rs2. This could likely re-use most of the circuitry for BFP, the other instruction in Zbf.

Packing N data registers D0,D1,..,D(N-1) into a bit field, using the lengths L0,L1,..,L(N-1):

PACK a0,D0,D1
REPACKI a0,a0,L0
PACK a0,a0,D2
REPACKI a0,a0,L1+L2
...
PACK a0,a0,D(N-1)
REPACKI a0,a0,L1+L2+...+L(N-2)

A word with N bitfields can be packed in 2*(N-1) instructions this way, when L1+L1+...+L(N-2) < XLEN/2 and L(i) <= XLEN/2 forall i in 0..(N-1). Only a few extra instructions are needed to stitch together the larger pieces in the remaining cases.

The main difference in use-case between REPACK and BFP is that the former is primarily useful for constructing a new struct of bitfields from its members whereas the latter is primarily useful when overwriting one particular bitfield in an such an existing struct, usually as part of a read-modify-write pattern.

PACKB

A Pack Bytes (PACKB) instruction in Zbb would help to pack structs of bytes.

uint_xlen_t packb(uint_xlen_t rs1, uint_xlen_t rs2)
{
    return (rs1&255) | ((rs2&255)<<8);
}

This would allow packing of 4 bytes into a 32-bit word in 3 instructions instead of 5 and would only require "Zbb" (that is it would not require SHFL, unlike the 5 instruction solution):

PACKB a0, a0, a1
PACKB a1, a2, a3
PACK[W] a0, a0, a1

Clarification on cmix documentation

I need a small clarification. The description for cmix instruction says that -

It is equivalent to the following sequence.
and rd, rs1, rs2
andn t0, rs3, rs2
or rd, rd, t0

Is it implied that the register t0 will be modified as a result of the execution of this instruction?

Pull request outstanding

@cliffordwolf
Hi Clifford, could you please take a look at the pull request I made a week ago to review whether it can be merged or I need to make some changes ?
Many Thx
Lee

Enhanced functionality for GREV / GREVI

I propose a compatible, useful, and low cost enhancement to the GREV/GREVI Generalized Reverse instructions.

At present each stage of GREV can either swap each pair of bits or else propagate them unchanged, as determined by the SHAMT bit for that stage.

I propose to perform some other function on each pair that would normally be swapped, with the same function being substituted for "swap" at each stage. The function to be performed is specified by one or more currently unused bits in rs2 or imm e.g. bit 6, or bits 6-7, or perhaps higher numbered bits to allow for 128 bit CPUs.

Supposing two bits are used, the encoding might be:

00: swap the two bits
01: both outputs are the OR of the two input bits
10: both outputs are the AND of the two input bits
11: I don't have a candidate. XOR isn't useful.

Due to DeMorgan's laws it is not necessary to provide both AND and OR, so if a useful 4th function can't be thought of then perhaps only one bit should be used.

EFFECT

I anticipate that OR or AND processing would normally be added to grev.w. grev.h, grev.b, grev.n, or grev.p. I have not evaluated whether use with other mask values is useful.

When used with one of the above masks the effect of OR instead of SWAP is to set the entire field to 1s if any bit in the field is 1. Fields consisting entirely of 0s remain as 0s. The effect of AND is to set the entire field to 0s if any bit in the field is 0. Fields consisting entirely of 1s remain as 1s.

For example, with an input of cbf20097147200ac the output of GREV.B.OR is ffff00ffffff00ff.

APPLICATIONS

If the output of the above GREV.B.OR is inverted to 0000ff000000ff00 then CLZ or CTZ can be used to determine the position of the first or last zero byte in the input value.

Alternatively, the input value could be inverted and then GREV.B.AND will produce the necessary input for CLZ or CTZ.

This is very valuable in efficient implementation of C string processing functions such as strlen(), strcpy(), strcmp().

Along with general benefits, this will provide a large boost to RISC-V scores in Dhrystone.

Using GREV.H.OR provides the same functionality for UTF-16 or UCS-2, or GREV.W.OR for UTF-32 or UCS-4.

COST

GREV is dominated by wire cost. The logic at each node is extremely small and increasing its size will not meaningfully impact the cost of GREV in either SoC or FPGA.

In particular, in FPGAs with splitable 6-LUTs we have five inputs (the two input bits, the SHAMT swap/fn enable for the stage, and my proposed two function select bits) and these inputs determine two independent bit outputs -- a perfect fit.

The attached program provides a reference C implementation of the proposed modification, checks that it produces the same output as the existing reference implementation when the function select bits are 0, and demonstrates finding all-zero fields of widths 4 to 32.

grev.txt

Consider changing BFP not to wrap around new field in result

As instruction BFP is currently defined, the bit field it overlays in the rs1 value may wrap around to span both the high (most-significant) and low (least-significant) ends of the result, For example, this sequence,

li t0,12<<24|26<<16|0xABCD
bfp t1,zero,t0

for RV32, leaves t1 with the value 0x340002AF, because the 16-bit value 0xABCD shifted left 26 bits (without clipping) is 0x2AF34000000, and this value wraps around from high bits to low bits in the result.

Are there expected advantages to this wrapping? My analysis indicates that the hardware for BFP (and for the B extension generally) can be a little reduced by not defining BFP to wrap around this way. The basic reason is because BFP requires the hardware to separately create a mask in addition to shifting (or rotating) rs2[15:0], and forcing this mask to wrap around adds a little extra circuitry.

I can imagine some applications might benefit from wrapping around, while others benefit more from not wrapping around. If there are good reasons to prefer wrap-around, I suggest adding an explanation of that choice to the document. If not, I propose modifying the specified behavior to use shifts instead of rotations, like so:

uint_xlen_t bfp(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int len = (rs2 >> 24) & 15;
    int off = (rs2 >> 16) & (XLEN-1);
    len = len ? len : 16;
    uint_xlen_t mask = slo(0, len) << off;
    uint_xlen_t data = rs2 << off;
    return (data & mask) | (rs1 & ~mask);
}

For my example above, the value left in t1 would then be 0x34000000.

(Note that, the hardware could still quietly substitute

uint_xlen_t data = rol(rs2, off);

for computing data without changing the behavior, if that's more convenient. The issue is just with the rotation of the mask.)

Should be ANDN instead of ANDC.

The long-established name of the logic operation that ANDs two Boolean inputs and complements the result is NAND, not CAND. RISC-V assembly language has a pseudo-instruction for a bitwise complement called NOT, not COMPL or whatever. This draft extension includes instructions called NAND, NOR, and C.NOT. For consistency, shouldn't the instruction that computes a bitwise AND with the complement of the second operand be called ANDN instead of ANDC?

Consider renaming CLMUL to XMUL

I request that instruction CLMUL, standing for "carry-less multiply", be renamed to XMUL, for "XOR multiply", meaning a multiplication where the partial products are summed by bitwise XORs instead of the usual additions. My reason is simply that I find the name "carry-less multiply" to be awkward, and I'm probably not alone. The name "carry-less multiply" appears not to be entrenched except in connection to x86 processors.

I attempted a search, and as far as I can tell, the "CLMUL" name has been adopted as part of a standard ISA only for the x86 (with SIMD instruction PCLMULQDQ). The B extension draft notes that the equivalent SPARC instruction is XMULX, officially documented as "XOR multiply". Most Web references to "carry-less multiply", "carry-less multiplication", and "carry-less product" seem to point back one way or another to Intel's CLMUL instruction. Also, there appears as yet to be no __builtin_clmul in GCC. The path should therefore be clear for us to choose the name XMUL if others agree with me it would be preferable.

internal compiler error: in decompose, at rtl.h:2279

Hello,

The compilation of source foo.c (see below) fails with internal error when the compiler is run as follows:

$ riscv64-unknown-elf-gcc -O2 -march=rv64ib -mabi=lp64 -S -o foo.S foo.c
during RTL pass: combine
foo.c: In function ‘foo’:
foo.c:4:1: internal compiler error: in decompose, at rtl.h:2279

Source code foo.c:

int foo(int n)
{
    return n + 0x7fffffff;
}

Additional notes:

  • Replacing -O2 with -O{1,0,s} does the same. When -O is omitted, the program compiles.
  • When -march=rv64ib is replaced with -march=rv64i, the program compiles.
  • When -march=rv64ib -mabi=lp64 is replaced with -march=rv32ib -mabi=ilp32, the program compiles.

gcc -v log:

[user@s01 bug]$ ../riscv64b/bin/riscv64-unknown-elf-gcc -Os -march=rv64ib -mabi=lp64 -v -S -o foo.S foo.c
Using built-in specs.
COLLECT_GCC=../riscv64b/bin/riscv64-unknown-elf-gcc
Target: riscv64-unknown-elf
Configured with: ../riscv-gcc/configure --prefix=/home/user/riscv-bitmanip/riscv64b --target=riscv64-unknown-elf --enable-languages=c --disable-libssp
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 10.0.0 20190929 (experimental) (GCC) 
COLLECT_GCC_OPTIONS='-Os' '-march=rv64ib' '-mabi=lp64' '-v' '-S' '-o' 'foo.S'
 /home/user/riscv-bitmanip/riscv64b/libexec/gcc/riscv64-unknown-elf/10.0.0/cc1 -quiet -v foo.c -quiet -dumpbase foo.c -march=rv64ib -mabi=lp64 -auxbase-strip foo.S -Os -version -o foo.S
GNU C17 (GCC) version 10.0.0 20190929 (experimental) (riscv64-unknown-elf)
        compiled by GNU C version 4.8.5 20150623 (Red Hat 4.8.5-39), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
ignoring nonexistent directory "/home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/../../../../riscv64-unknown-elf/sys-include"
#include "..." search starts here:
#include <...> search starts here:
 /home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/include
 /home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/include-fixed
 /home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/../../../../riscv64-unknown-elf/include
End of search list.
GNU C17 (GCC) version 10.0.0 20190929 (experimental) (riscv64-unknown-elf)
        compiled by GNU C version 4.8.5 20150623 (Red Hat 4.8.5-39), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Compiler executable checksum: bff2308ac495c30be0d25ad6caff4627
during RTL pass: combine
foo.c: In function ‘foo’:
foo.c:4:1: internal compiler error: in decompose, at rtl.h:2279
    4 | }
      | ^
0x556a3e wi::int_traits<std::pair<rtx_def*, machine_mode> >::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> const&)
        ../../riscv-gcc/gcc/rtl.h:2277
0xbd4961 wi::int_traits<std::pair<rtx_def*, machine_mode> >::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> const&)
        ../../riscv-gcc/gcc/wide-int.h:3102
0xbd4961 wide_int_ref_storage<std::pair<rtx_def*, machine_mode> >
        ../../riscv-gcc/gcc/wide-int.h:1032
0xbd4961 generic_wide_int<std::pair<rtx_def*, machine_mode> >
        ../../riscv-gcc/gcc/wide-int.h:790
0xbd4961 add<std::pair<rtx_def*, machine_mode>, std::pair<rtx_def*, machine_mode> >
        ../../riscv-gcc/gcc/wide-int.h:2422
0xbd4961 simplify_const_binary_operation(rtx_code, machine_mode, rtx_def*, rtx_def*)
        ../../riscv-gcc/gcc/simplify-rtx.c:4318
0xbd9cde simplify_binary_operation(rtx_code, machine_mode, rtx_def*, rtx_def*)
        ../../riscv-gcc/gcc/simplify-rtx.c:2156
0x1227a71 combine_simplify_rtx
        ../../riscv-gcc/gcc/combine.c:5804
0x122a492 subst
        ../../riscv-gcc/gcc/combine.c:5726
0x122a108 subst
        ../../riscv-gcc/gcc/combine.c:5667
0x122c4ed try_combine
        ../../riscv-gcc/gcc/combine.c:3422
0x12323d8 combine_instructions
        ../../riscv-gcc/gcc/combine.c:1305
0x12323d8 rest_of_handle_combine
        ../../riscv-gcc/gcc/combine.c:15066
0x12323d8 execute
        ../../riscv-gcc/gcc/combine.c:15111
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

System info: CentOS 7, uname -a: Linux XXXXXXXXX 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

documentation issue

@cliffordwolf
Hi Clifford,
I have a query regarding the documented behavior for fsl, the text says that
The fsl rd, rs1, rs2, rs3 instruction creates a 2 x XLEN word by concatenating rs1 and rs3
(with rs1 in the MSB half)
from the pseudo code, it looks as though rs1 is in the lower half.
can you clarify which is correct ?
Thx
Lee

ANDC Bit Encoding

So I was looking at the encoding a lot are not fully described, but this one seems obvious too me.
| ??????? | rs2 | rs1 | ??? | rd | 0110011 | ANDC

Here is the current AND:
| 0000000 | rs2 | rs1 | 111 | rd | 0110011 | AND

Why not define ANDC much like Shift Right and Arithmetic shift right. Also ADD
to SUB is very similar. So using this bit to denote negation seems pretty intuitive.
| 0100000 | rs2 | rs1 | 111 | rd | 0110011 | ANDC

While probably not as useful as ANDC it's also logical easy to extend the complemented
inputs for the other bitwise instructions like this.

| 0100000 | rs2 | rs1 | 100 | rd | 0110011 | XORC
| 0100000 | rs2 | rs1 | 110 | rd | 0110011 | ORC

Also when it comes to hardware implementation negation is used when converting a positive number to a negative number (2's complement). Therefore, it would be easy to use the same negation hardware if the same bit was used to denote negation. Also keeping func3 the same means an ALU only has worry about choosing to negate an input.

Implement "-march=rv32ib_Zbb_Zbc_...ZbX" ISA subset selection

Hi all,

I have put together a small patchset which implements command-line bitmanip ISA subset selection. I've tried to follow the current bitmanip draft spec as closely as possible. In particular, the 'B' bitmanip subset is taken as the one indicated by the extended dotted line (everything excluding Zbt, Zbf.)

Please also read the "Problems" section at the bottom.

The bitmanip spec currently defines 9 subgroups of instructions. I defined them as target flags residing in a new target variable called "x_riscv_bitmanip_flags", each called accordingly:

OPTION_MASK_BITMANIP_ZBB
OPTION_MASK_BITMANIP_ZBC
OPTION_MASK_BITMANIP_ZBE
OPTION_MASK_BITMANIP_ZBF
OPTION_MASK_BITMANIP_ZBM
OPTION_MASK_BITMANIP_ZBP
OPTION_MASK_BITMANIP_ZBR
OPTION_MASK_BITMANIP_ZBS
OPTION_MASK_BITMANIP_ZBT

When invoking gcc as follows:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c

the flag states become

ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1    1    1    0    1    1    1    1    0    1

If the user provides atleast one sub-ISA specifier, then only the sub-ISA flags are honoured. e.g.:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib_Zbb -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c

will set

ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1    0    0    0    0    0    0    0    0    1

and all following sub-ISA specifiers are simply additive, e.g.:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib_Zbb_Zbf_Zbt -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c

will set

ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1    0    0    1    0    0    0    0    1    1

and so forth. The user must provide the 'b' keyword before adding (if any) "ZbX" directives, and the "ZbX" directives must always follow directly after the 'b' directive. The "ZbX" directives must always have atleast one set of underscores surrounding them. If there are multiple "ZbX" directives, they must come one after the other.

The "MASK_BITMANIP" target macro is still there, but it is not used in the bitmanip.md conditions.

There are 3 patches. Each can be applied without breaking the build, but they must come in the right order.
Patch 1: add riscv.opt "riscv_bitmanip_flags" variable, and associated masks.
Patch 2: modify bitmanip.md insns to only get generated if the associated subset mask is set.
Patch 3: modify riscv-common.c to accept "ZbX" form of subset ISA flags.

I've also provided a patch that applies all of them at once.

There is almost certainly some fiddling to be done with the riscv.c file aswell, and I suspect there to be a few corner cases in the parser, but this is useful as it is and would like to see what people think of the current implementation.

Output assembly arch attributes look like this:
.attribute arch, "rv32i2p0_b2p0_Zbb2p0_Zbc2p0_Zbt2p0_Zbp2p0"

Patches:
0001-add-bmi-subisa-march-opts.patch.txt
0002-add-bmi-subisa-march-bitmanip.patch.txt
0003-add-bmi-subisa-march-common.patch.txt
add-bmi-subisa-march-all.patch.txt

Problems:

  1. Canonical flags order.

No order of flags is currently enforced. What is the canonical order of the ZbX flags? Alphabetical? Or something else?

  1. Redundant flag names.

Unfortunately, GCC refuses to create target masks relative to a specified variable if a flag name is not provided. I am referring to the riscv.opt file.

Take for example the ZBB directive:

...
mbmi-zbb
Target Mask(BITMANIP_ZBB) Var(riscv_bitmanip_flags)
Support the base subset of the Bitmanip extension.
...

This causes the following code to be generated in build/gcc/options.h:

#define OPTION_MASK_BITMANIP_ZBB (HOST_WIDE_INT_1U << 0) // <<<<<<<<<<<<<<<<
#define OPTION_MASK_BITMANIP_ZBC (HOST_WIDE_INT_1U << 1)
#define OPTION_MASK_BITMANIP_ZBE (HOST_WIDE_INT_1U << 2)
#define OPTION_MASK_BITMANIP_ZBF (HOST_WIDE_INT_1U << 3)
#define OPTION_MASK_BITMANIP_ZBM (HOST_WIDE_INT_1U << 4)
#define OPTION_MASK_BITMANIP_ZBP (HOST_WIDE_INT_1U << 5)
#define OPTION_MASK_BITMANIP_ZBR (HOST_WIDE_INT_1U << 6)
#define OPTION_MASK_BITMANIP_ZBS (HOST_WIDE_INT_1U << 7)
#define OPTION_MASK_BITMANIP_ZBT (HOST_WIDE_INT_1U << 8)
#define MASK_DIV (1U << 0)
#define MASK_EXPLICIT_RELOCS (1U << 1)
#define MASK_FDIV (1U << 2)
#define MASK_SAVE_RESTORE (1U << 3)
#define MASK_STRICT_ALIGN (1U << 4)
#define MASK_64BIT (1U << 5)
#define MASK_ATOMIC (1U << 6)
#define MASK_BITMANIP (1U << 7)
#define MASK_DOUBLE_FLOAT (1U << 8)
#define MASK_HARD_FLOAT (1U << 9)
#define MASK_MUL (1U << 10)
#define MASK_RVC (1U << 11)
#define MASK_RVE (1U << 12)

We probably don't want to expose this dual method of specifying subisas, so let's try removing the -mbmi-zbb name from riscv.opt:

...
Target Mask(BITMANIP_ZBB) Var(riscv_bitmanip_flags)
...

This causes the following output in build/gcc/options.h:

#define OPTION_MASK_BITMANIP_ZBC (HOST_WIDE_INT_1U << 0)
#define OPTION_MASK_BITMANIP_ZBE (HOST_WIDE_INT_1U << 1)
#define OPTION_MASK_BITMANIP_ZBF (HOST_WIDE_INT_1U << 2)
#define OPTION_MASK_BITMANIP_ZBM (HOST_WIDE_INT_1U << 3)
#define OPTION_MASK_BITMANIP_ZBP (HOST_WIDE_INT_1U << 4)
#define OPTION_MASK_BITMANIP_ZBR (HOST_WIDE_INT_1U << 5)
#define OPTION_MASK_BITMANIP_ZBS (HOST_WIDE_INT_1U << 6)
#define OPTION_MASK_BITMANIP_ZBT (HOST_WIDE_INT_1U << 7)
#define MASK_DIV (1U << 0)
#define MASK_EXPLICIT_RELOCS (1U << 1)
#define MASK_FDIV (1U << 2)
#define MASK_SAVE_RESTORE (1U << 3)
#define MASK_STRICT_ALIGN (1U << 4)
#define MASK_64BIT (1U << 5)
#define MASK_ATOMIC (1U << 6)
#define MASK_BITMANIP (1U << 7)
#define MASK_DOUBLE_FLOAT (1U << 8)
#define MASK_HARD_FLOAT (1U << 9)
#define MASK_MUL (1U << 10)
#define MASK_RVC (1U << 11)
#define MASK_RVE (1U << 12)
#define MASK_BITMANIP_ZBB (1U << 13) // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

It's now relative to the general-purpose target_flags variable, rather than the riscv_bitmanip_flags. Is there is some magic combination of GCC option properties to get around this?

SROW/SLOW shift masking

@cliffordwolf
Hi Clifford
In the spec the pseudo code for SROW is as follows

int shamt = rs2 & (XLEN - 1);
return ~(~rs1 >> shamt);

in the case of SROW, is XLEN the length of the machine (64 bit) or the target of the operation (32 bit)

in other words if
rs2 = 0xFFFF_FFFF_FFFF_FFFF
is the shift amount 31 or 63 ?

Benchmark - Variable Length Unary Integer Coding (clz/ctz)

This might be useful:

VLU(...) is a little-endian variable-length integer coding that prefixes data bits with unary code length bits. The length is recovered by counting the least significant set bits, which encode a count of n-bit basic units. The data bits compactly trail the unary code prefix.

  • encode uses clz
  • decode uses ctz

With an 8 bit basic unit, the encoded size is similar to LEB128; 7-bits can be stored in 1 byte, 56-bits in 8 bytes and 112-bits in 16 bytes. Decoding, however, is significantly faster than LEB128, as it is not necessary to check for continuation bits every byte, instead the length can be decoded in a single count bits operation.

While VLU is not in major use, it could be substituted where LEB128 is used with reasonably significant benefits depending on the frequency of variable length fields. LEB128 probably performs similarly to VLU on a machine without bit scan forward and reverse. There are also potential SIMD or vector optimisations. For example, a decoder could have a predictor, and switch from a "per field" mode to a set of optimized modes. e.g. 128-bit SIMD code for parallel decoding of 16 x 7-bit fields.

There is symmetry in the encode and decode, with clz for figuring out the size of a word, and ctz to read the /prefix/ from the little-end. The code is a pretty good example of why little-endian makes more sense. The benchmarks currently perform decoding of 8-bit through to 56-bit and there is an optimized decoder for x86-64 BMI. I am investigating x86 SIMD and want to add support for big numbers. 112-bits and >= 128-bits.

Behavior of *w instructions

The base ISA defines ADDW and SUBW like so:

ADDW and SUBW are RV64I-only instructions that are defined analogously to ADD and SUB but operate on 32-bit values and produce signed 32-bit results. Overflows are ignored, and the low 32-bits of the result is sign-extended to 64-bits and written to the destination register.

The Bitmanip extension refers to clzw, ctzw, pcntw, etc. but doesn't actually define how they work. The pseudocode is only defined for the instructions that operate on data of size XLEN.

Sophisticated readers understand that the *w instructions in Bitmanip are almost certainly supposed to behave like the *W instructions in the base architecture (operating on 32-bit data and then sign-extending the result to XLEN). However, this should be explicitly addressed with some verbiage and pseudocode so that everything is fully specified and unambiguous.

CRC Polynomials and Encoding

I am not sure that hard coding the polynomials for CRC is a good idea. It seems pretty constrained to add a CRC instruction but only support 2 polynomials. Like the fact it does have 0xedb88320, what I call IEEE polynomial, and Castagnoli polynomial 0x82f63b78 does cover a large number of uses.

However, for instance here is a list of polynomials from Philip Koopman, and people may use different ones depending on the application.
https://users.ece.cmu.edu/~koopman/crc/index.html

Honestly, I think using funct7 | rs2 | rs1 | f3 | rd | opcode | R-type would be better than the unary format. This would allow one to load polynomials from rs2. It would also only need 2 bits in funct7 for B, H, W, and D. Also a single bit in funct7 could select between 0xedb88320 and 0x82f63b78. There are two ways I could see handling the predefined polynomials.

For instance if we wanted to have default to IEEE polynomial we can take advantage RS0 or the zero register by change the pseudo code to this. This also would allow anyone to load any other polynomial from a register, but it must be XOR’ed with polynomial by 0xedb88320 beforehand. For instance if I wanted to use this polynomial 0xeb31d82e XOR’ing gives me this 0x06895b0e which I can then use as the constant I load in the rs2 register.

uint_xlen_t crc32(uint_xlen_t rs1, uint_xlen_t rs2 int nbits) {
    for (int i = 0; i < nbits; i++) 
        rs1 = (rs1 >> 1) ^ ( ( 0xEDB88320 ^ rs2 ) & ~((rs1 & 1) - 1)); 
    return rs1; 
}

The other option would be to use a predefined polynomial if rs2 is the zero register. This avoids XOR’ing the desired polynomial. Although, I am not sure XOR’ing is problem since it can easily be done once outside of run time for a given polynomial.

Possible decode error for FSRI ?

@cliffordwolf
Hi Clifford,
I wonder if you could clarify something.
In my model containing the B extensions, I am getting a decode conflict reported for FSRI during our static analysis phase, could you clarify this for me?
Here are the decode definitions


    DECODE_ENTRY(0, RORI,     "|01100.......|.....|101|.....|0010011|");
    DECODE_ENTRY(0, SBEXTI,   "|01001.......|.....|101|.....|0010011|");
    DECODE_ENTRY(0, SROI,     "|00100.......|.....|101|.....|0010011|");
    DECODE_ENTRY(0, FSRI,     "|.....1......|.....|101|.....|0010011|");

a '.' indicates a wildcard, so as you can see FSRI overlaps with RORI, SBEXTI and SROI
the FSRI[26] is part of the immediate value in the RORI, SBEXTI and SROI instructions, and
FSRI[31:27] is rs3, but part of decode in RORI, SBEXTI and SROI

what are your thoughts ?
could this be a documentation error, and FSRI should be
DECODE_ENTRY(0, FSRI, "|.....1......|.....|101|.....|0110011|");
not
DECODE_ENTRY(0, FSRI, "|.....1......|.....|101|.....|0010011|");

Thx
Lee

sbclri: invalid format specifier

Hi all,

In the proposed patch for bitmanip Zbs family of instructions found here, the format specifier for sbclri doesn't work right because we're expecting a const_int with one bit low.

So, in order for `ctz_hwi' to work right, we need to invert the operand first, and for that we need a separate format specifier (or amend the const_int opcode in-place in the .md file, but I chose to do the former.)

The patch:
gcc-b-support-fix-sbclri.patch.txt

Testcase:

unsigned int
f(unsigned int a)
{
        return a & ~(1 << 29);
}

Before:

f:
        sbclri  a0,a0,0
        ret

After:

f:
        sbclri  a0,a0,29
        ret

Problem with installing the bit manipulation gcc compiler

Hello guys,
last day I struggled with building the bit manipulation patch for gcc compiler. when running the following command bash gcc-build.sh and after of course configuring the shell file. I face this error of missing configuration
bit-manip

Proposal: func7 "Quadrants" for OP[32/128][+/-IM] opcode family

RISC V context for this proposal

This "OP Quadrant" proposal below has global implications for RISC V instruction encoding, but I propose it here in BitManip, as this is the first extension (other than 'M') to need such organisation within func7. (func7 & func3 have the usual meaning for the R-type instruction format).

Note: the RISC V user ISA spec explicitly states that RV128 may introduce new 128 bit instructions into an OP128 major opcode, which is the reverse of what happened for RV64. I assume this will be the case, in discussion below.

Why Quadrants are needed

"Contiguous" reserved opcode space is a precious resource. RISC V has only three reserved major opcodes left for future standard extensions.

Up to now, the only values of func7 for instructions within the OP-INT family of major opcodes are 0b0000000, 0b0000001 (MUL/DIV), and 0b0100000 (SUB/SRA). Bitmanip will substantially expand the usage of func7 values. It is important this is done in a rational way, as func7 values chosen within OP will also have major side effects on OP-IM, OP32[IM], and OP128[IM].

Within OP32 and OP128, up to 50% of these major opcodes are available as contiguous reserved space (for func3 values = 0bX1X (where X = 0 or 1), ie: do not correspond to any "Q" or "W" instruction. Care needs to be taken not to punch "holes" into this space. (Unfortunately, two "M" instructions break this rule in OP32, reducing OP32 continuous free opcode space slightly)

Problems with proposed v0.90 BitManip encoding

The current BitManip v0.90 encoding proposal are bit problematic in this regard as it punches "holes" into "non-W" sections of OP32. These non-W sections otherwise form part of an unused 50% of OP32/OP128, and scattered holes within them will limit the long term usefulness for other future extensions. (An example of a "hole" created in OP32 is BDEPW, which has a func3 value of 0b010).

BitManip v0.90 also unnecessarily introduces a new two source R-type format specifically for one instruction, FSRI, which moves the rs2 register field to a new position. This will complicate implementation of superscalar out-of-order microarchitectures, and breaks the existing RISC V approach of keeping rs1 and rs2 in the same positions for every relevant instruction.

Why a Quadrant division is intrinsically imposed onto func7 organisation

The choice of 4x32 value Quadrants is not an arbitrary choice. It is in fact fundamental to the organisation of RV32, RV64 and RV128.

There is a 32 x value func7 constraint for I-type shift instructions with a 7 bit (RV128) immediate fields. For RV64, I-type shift instructions have a 6 bit immediate field and can encode 64 values in their remaining instruction bits, hence translating into a 64 x value func7 constraint. (Hence Quadrants A & B need to be created for these distinct 2x32 value subsets of func7).

Also, dividing up func7 into Quadrants is natural for ternary instructions, as blocks of 32 x func7 values are needed to introduce an "rs3" instruction format (hence Quadrant "D" needs to be created for such rs3-type instructions).

Quadrants in detail

Below is an outline of how func7 should be structured into Quadrants A-D, based on the last two bit values of func7 (shown below as ' | 00' to ' | 11' ):

Quadrant A1 (n=1): instructions with func7 = 0b00000 | 00

  • have matching I-type instruction in OP-IM/OP-32IM/OP-128IM
  • have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX0X
  • does NOT have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX1X

Quadrant A2 (n=29): instructions with func7 in range 0b00001 | 00 to 0b11101 | 00

  • (for func3=0bX01 only) have matching I-type instruction in OP-IM/OP-32IM/OP-128IM
  • have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX0X
  • does NOT have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX1X

Quadrant A3 (n=2, but could grow if needed): instructions with func7 = 0b1111X | 00

  • (if func3=0bX01) unary instructions within OP[32/64/128]IM, ie: the lower 5 bits of imm12 field is replaced by a func5 operand, which specifies a unary function operating on rs1 and storing result in rd. The func5 operand can specify 32 unary functions, of which func5 values 0b00000 & 0b00001 are reserved for functions which are derived from taking the corresponding two input Quadrant A3 OP function, and applying the value of zero or one to one of the two inputs (to yield a unary function). The remaining 30 unary functions can be arbitrary unary functions.
  • otherwise the rules for Quadrant A2 also apply to Quadrant A3

Quadrant B (n=33): func7 value in range 0b00000 | 10 to 0b11111 | 10

  • same as Quadrant A, except does not have corresponding I-type instructions for OP128IM.

Quadrant C (n=32): func7 value in range 0b00000 | 01 to 0b11111 | 01

  • currently used only by MUL/DIV, which (unfortunately) punches a hole in unused OP32 opcode space, by putting W version instructions into func3=0bX1X. (Maybe OP128 can avoid doing this in future, to reserve a fully contiguous half of the OP128 major for non-"Q" instruction uses).
  • does not have matching I-type instruction
  • can have matching "W" or "Q" instruction in OP-32/OP-128 for any func3 value

Quadrant D (n=32): func7 value in range 0b00000 | 11 to 0b11111 | 11

  • reserved for ternary functions (ie: instructions with an additional rs3 field, or with an additional 5 bit immediate operand). In this case, the FSRI instruction (with 64 shift range) can be replaced with FSLI and FSRI instructions, each with a 32 shift range.
  • rs3 field exists if func3=0bXX1 otherwise imm5 exists if func3=0bXX0 (note the instruction is still placed within OP, and not OP-IM despite the existence of imm5 as there are two source register inputs)
  • can have matching "W" or "Q" instruction in OP-32/OP-128 for func3 values = 0bX0X

Example BitManip encoding using Quadrants

Below is an example of how the above quadrants can be used to organise the BitManip proposed instructions:

func7 rs2 rs1 rd opcode func3►
000 100 001 101 010 011 110 111
Group A1 00000.00 rs2 rs1 rd 0110011 ADD XOR SLL SRL SLT SLTU OR AND
Group A2 01000.00 rs2 rs1 rd 0110011 SUB XNOR SBINV SRA ORN ANDN
00001.00 rs2 rs1 rd 0110011 ADDU.W PACK SBSET GREV MIN MINU
01001.00 rs2 rs1 rd 0110011 SUBU.W SBCLR SBEXT MAX MAXU
00010.00 rs2 rs1 rd 0110011 ROL ROR SLO SRO
01010.00 rs2 rs1 rd 0110011 BDEP BEXT SHFL UNSHFL
Group A3 11111.00 rs2 rs1 rd 0110011 CLMUL CLMULR CLMULH
Group B xxxxx.10
Group C 00000.01 rs2 rs1 rd 0110011 MUL DIV MULH DIVU MULHSU MULHU REM REMU
Group D rs3/imm5.11 rs2 rs1 rd 0110011 FSLI FSRI FSL FSR CMOVI CMOV CMIX

Note 1: OP-IM, OP32 and OP-32IM are not shown as these are automatically implied by the quadrant in which each instruction is added
Note 2: RORI not included as can be replaced by FSRL/FSLI, and bitmatrix instructions not shown as these are RV64I only and best placed in OP32 with func3=0bX1X
Note 3: Unary instructions not shown, are placed into OP-IM in the slot occupied by CLMULH (ie: Group A3 with func3=0bX01).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.