riscv / riscv-bitmanip Goto Github PK

View Code? Open in Web Editor NEW

207.0 49.0 66.0 124.42 MB

Working draft of the proposed RISC-V Bitmanipulation extension

Home Page: https://jira.riscv.org/browse/RVG-122

License: Creative Commons Attribution 4.0 International

Makefile 100.00%

riscv-bitmanip's People

Stargazers

Watchers

Forkers

pdonahue-ventana harrysarson sailfish009 mablinov felixonmars reinerp adshenoi frazerclews ismail774403783 piaopiaohun vkvpgt markus-oberhumer-forks brightclark ganoam vineetjain07 mfkiwl imperas leetao13 frasercrmck nelson1225 pantea95 kito-cheng krishnadasar eroom1966 brson rbohn zhouzaixin libraries tariqkurd-repo ben-marshall nibrunieatsi5 brucehoult ptomsich slobodova briangraysonsiv dansmathers xuann6 adchd hirooih cmuellner jzhou1318 bilalsakhawat sequencer andrewstanfordjason tomverbeure dramforever luhaocong egouriou-rivos bedros tomlu313141 zenithalhourlyrate vjaligam5 isabella232 pinkdiamond1 psherman42 fioraking riscv-spec jones-drew neelgala msyksphinz-self erihsu murthuja1708 luyahan palmer-dabbelt clownsw

riscv-bitmanip's Issues

Build riscv64-unknown-linux-gnu- tools

Is there anyway to build the linux toolchain with B support?

I took a look at the scripts and it looks like you could add a compile for the linux tools(??)

Non zero imm[5] encodings of sbsetiw, sbclriw, and sbinviw

Are the sbsetiw, sbclriw, and sbinviw encodings with imm[5] != 0 reserved?

minor bitmanip patch required on next gcc rebase

I just committed a patch upstream to optimize a zero-extend followed by an array indexing left shift. This was sometimes three instructions, now it is two. This added a pattern zero_extendsidi2_shifted that is identical to the bitmanip branch slliuw pattern, except that it splits into two shifts instead of emitting an slliuw instruction. In order for the slliuw pattern to continue working, this upstream pattern will need a ! TARGET_BITMANIP check added to its condition. This issue is just to document the problem for when we eventually rebase later.

Make instruction names consistent for B and V extensions

In a few instances, the current drafts for the B and V extensions have instructions that give different names to the same operation.

The B extension has PCNT, while the V extension has an instruction called VMPOPC, where "POPC" also stands for population count.

The B extension has ANDC, while the V extension has VMANDNOT and VMORNOT.

For the examples above, one or the other extension (or both) must be changed to avoid gratuitous inconsistencies.

Furthermore, the B extension has CMOV, while the V extension has VMERGE that performs the same function element-wise for vectors. (My preference would be to have SELECT and VSELECT, or SEL and VSEL, but if those are impossible, I propose renaming CMOV as MERGE.)

Finally, the B extension has BEXT which operates on bits, while the V extension has VCOMPRESS that performs the same function on vector elements.

Missing library file "vsupport.h"

I'm trying to compile examples with the compiled toolchain from the riscv repository. However, I could not find the said file in any repository here. From where does it come from?

Doesn't work sh1addu.w

In rvb_simple EU:
When execute sh1addu.w shadd_active == 1 and wuw_active == 1 at the same time.
So rd = shadd_out | wuw_dout.

wuw_dout == 1 because din_insn14 == 0. So din_insn14 == 1 in sh2addu.w and sh3addu.w and they work.

Exceptions

@cliffordwolf
Hi Clifford
for an XLEN=32 or XLEN=64 Implementation, should the following instruction raise an exception ?

sbseti x1, x2, 127

Thx
Lee

Discussion on *WU instructions

Continuation of off-topic discussion in #10.

Quick summary:

@jhauser-us:

[T]here's a whole category of instructions that would have more impact but aren't currently included, and that is the unsigned equivalents of the existing RV64I *W instructions: ADDWU, SUBWU, SLLWU, etc. These would be just like the existing *W instructions but instead zeroing the upper 32 bits, as appropriate for an unsigned int or uint32_t result type rather than int or int32_t. We hardly need to run any experiments to know that such *WU instructions would be used far more frequently than the *W instructions proposed for the B extension.

@cliffordwolf:

The idea behind *W instructions is simply that they operate on the lower 32 bits. It makes sense to have a consistent scheme for how to fill the upper bits, but it doesn't matter much what this scheme is exactly, if that scheme is sign-extend or zero-extend.

@brucehoult:

If you provide *W operations you have to decide what to do with the upper bits. You can leave then alone (x86), zero extend them (Aarch64), or sign extend them (RISC-V). If you leave them alone then casting a 32 bit value to a 64 bit value requires a sext or zext every time. If you sign extend them then only unsigned values require a zext, signed ones are already correct. If you zero extend them then only signed values require a sext, unsigned ones are already correct. It's hard to say which is better. Most normal application code uses signed more than unsigned, favouring sign extension

@jhauser-us:

Andrew has told me he wants to keep open the option for *WU instructions for now, which implies to me we should devise a system to reserve the encoding space now, even if the idea eventually gets dropped. I haven't run this particular system by him yet, but I intend to do so soon.

Naturally, we would want any system that's adopted to be fully compatible with the B extension, by tweaking either or both as necessary. I'll be looking into this question soon. And now you or anyone else can do so too, if you're so inclined.

But this is getting off-topic for this GitHub issue, so we should move the discussion elsewhere if you'd like to continue.

@brucehoult:

I agree that there is merit in systematically adding *WU versions of R-type instructions that have *W versions. If you do it at all then it should be for BOTH the base instruction set and for any new *W instructions in the BitManip extension.

This is easily affordable in terms of opcode space by, as has been pointed out, using something in the hi bits of the instruction, keeping the identical opcode and func3 to the *W version. It's also very cheap to implement.

Probably the only OP-IMM-32 instruction that can be justified is ADDIWU. That does need a new opcode.

@jhauser-us:

Interestingly, in a message to me yesterday where we talked about possible *WU instructions, Andrew literally wrote:

I’m not opposed to putting them in B.

To be sure, that expresses ambivalence more than advocacy. But as far as "could be persuaded" goes, I believe yes.

After that I conceded that there is a reasonable way to add *WU instructions and described the encoding that I would prefer.

I hope everyone feels like this summary treats them fairly. Please post your corrections below.

Need better support for signed bytes and halfwords?

With the current draft of the B extension, RISC-V has single instructions that can be used to zero-extend a byte (ANDI), halfword (PACKW), or word (PACK or ADDIWU) to full 64-bit register width. A word can also be sign-extended in one instruction (ADDIW). But, unless I'm mistaken, there is still no single instruction that can sign-extend a byte or halfword to 64 bits. If there are constituencies out there that make frequent use of signed char and short types (embedded applications with limited memory, perhaps?), such instructions might get more use overall than others that are being included.

Similarly, for reading big-endian data, we currently have the ability in only two instructions to load and byte-swap an unsigned halfword (LHU + BSWAP.H), an unsigned word (LWU + BSWAP.W), or a signed word (LWU + GREVIW), but not a signed halfword, which takes three instructions.

bitmanip opcode encoding table: FSRI overlap SBEXTI, GORCI, GREVI, RORI, SROI, SRAI, SRLI?

Hi,
in the opcode encodings table, FSRI shows that bit 26 needs to be 1. Can we add to the table that SBEXTI, GORCI, GREVI, RORI, SROI, SRAI, SRLI bit 26 needs to be 0? Otherwise it looks like there is overlap in the encodings. I see the text above the table mentions op[26]=1 selects funnel shifts, but it might be helpful to show in the table as well.

Thanks,
Dan

Suggest renaming GREVI pseduo-instructions to mirror ZIP/UNZIP

The pseudo-instructions defined for GREVI (BREV.P, PSWAP.N, etc.) are like the ZIP and UNZIP pseudo-instructions in that they move "units" of a power-of-two size within "components" of a larger power-of-two size. For the ZIP/UNZIP pseudo-instructions, there is a simple pattern of

ZIP<unit-size><component-suffix>
UNZIP<unit-size><component-suffix>

In this system, <unit-size> is either empty, meaning 1 bit, or is a decimal number of bits ("2", "4", "8", or "16"); and <component-suffix> is either empty, meaning the full register size (XLEN), or is one of the suffixes '.N', '.B', '.H', or '.W'.

It would aid comprehension if the GREVI pseudo-instructions followed the same system. I propose

REV<unit-size><component-suffix>

This would rename all of the GREVI pseudo-instructions as follows:

BREV.P  -> REV.P
PSWAP.N -> REV2.N
BREV.N  -> REV.N
NSWAP.B -> REV4.B
PSWAP.B -> REV2.B
BREV.B  -> REV.B
BSWAP.H -> REV8.H
NSWAP.H -> REV4.H
PSWAP.H -> REV2.H
BREV.H  -> REV.H
HSWAP.W -> REV16.W
BSWAP.W -> REV8.W
NSWAP.W -> REV4.W
PSWAP.W -> REV2.W
BREV.W  -> REV.W
WSWAP   -> REV32
HSWAP   -> REV16
BSWAP   -> REV8
NSWAP   -> REV4
PSWAP   -> REV2
BREV    -> REV

If the name "BSWAP" is so entrenched that we feel we must have this mnemonic, then BSWAP can be made another pseudo-instruction alias for REV8.

preprocessor macro for bitmanip

The gcc patch should define a preprocessor macro so end users can check to see if bitmanip support is enabled for the target. I would suggest __riscv_bitmanip since the rest of the code seems to be using bitmanip consistently.

See also the riscv/riscv-c-api-doc repo where we are documenting C API issues like preprocessor macros. I'm filing a pull request there to suggest __riscv_bitmanip which can be changed if someone has a better suggestion.

brev removed from spec, still in cproofs

@cliffordwolf
Hi Clifford
I note that c.brev and c.not have been removed from the spec, but they still appear in the cproofs/insns.h
is this an oversight, or intentional as it is a pseudo-op
Thx
Lee

Bring sanity to source operand order for ternary instructions

Please, please, please, in the assembly language for the ternary instructions, do not place the control operand of CMIX between the other two source operands, do not place the condition operand of CMOV between the other two source operands, and do not place the shift amount for funnel shifts (FSL and FSR) between the other two source operands.

I understand the hardware motivation for having the control and shift amounts be in rs2, which forces the other two operands to be rs1 and rs3. But it would be better to define the assembly language for these instructions as

CMIX rd,rs2,rs1,rs3
CMOV rd,rs2,rs1,rs3
FSL  rd,rs1,rs3,rs2
FSR  rd,rs1,rs3,rs2

Whatever extra trouble a nonlinear operand order might cause for tools authors, it is nothing compared to the multiplicative effect of foisting an illogical order on programmers. Let's not forget, there are literally thousands of programmers for every tools author, and we'd prefer as often as possible to help those programmers write bug-free code.

I note that assembly language pseudo-instructions already provide some precedent for breaking a definite connection between operand order and source register numbers. Store instructions are another existing exceptional case, being written as

SW rs2,offset(rs1)

and not

SW rs1,rs2,offset

Is orc16.w in Zbb subset?

Is it the intention that orc16.w is included in the Zbb subset, or is it just orc16?

Thanks.

On some INSTW or INST.W instruction definitions

There are some INSTW or INST.W instructions, such as:

riscv-bitmanip/texsrc/bext.tex

Line 224 in a05231d

packw rd, rs1, rs2

riscv-bitmanip/texsrc/bext.tex

Line 1686 in a05231d

{\tt addu.w} and {\tt subu.w} are identical to {\tt add} and {\tt sub}, except

W instructions are mean to keep 32bit computations on RV64 machine, is this property valid?

Property A: INST.W instructions at RV64 generate the same result as INST ones at RV32.
Porperty B: INST.W instructions always sign-extended the lower 32bit of the result.

In the current bitmanip specs, these two properties seem not be valid, for example, shnaddu.w, adduw, subuw returns the 64 bit results not the sign-extended lower 32bit.

Extending gcc-build.sh to compile g++ compiler

Is there a recipe available to compile g++ with support for bitmanip extension? I tried adding c++ to the --enable-languages parameter but that did not help.

spike outputs "args unknown" on bitmanip instruction

Hi,

I got below output when using spike to run B extension.

core 0: 0xffffffff80001742 (0x60191b93) ctz (args unknown)
core 0: 0xffffffff80001746 (0x61a01a33) rol (args unknown)
core 0: 0xffffffff8000174a (0x003199a3) sh gp, 19(gp)
core 0: 0xffffffff8000174e (0x406eeeb3) orn (args unknown)
core 0: 0xffffffff80001752 (0x41f6fbb3) andn (args unknown)
core 0: 0xffffffff80001756 (0x0aeef133) maxu (args unknown)

Here is my command.
spike --isa=rv32imcb -l test.o

Any suggestion? Thanks

About register content and operand order in pack/packu/packh

This is ARM pack instruction
PKHBT Rd, Rn, Rm ## Rd = Rm[31:16]|Rn[15:0], Bottom of Rn, Top of Rm
PKHTB Rd, Rn, Rm ## Rd = Rn[31:16]|Rm[15:0], Top of Rn, Bottom of Rm
I can easily tell which part is taken from Rn/Rm.

In pack instruction format:
pack rd, rs1, rs2

rd = rs2[15:0]|rs1[15:0]

kind of reverse to the order in assembly express.

Also in funnel shift
fsr rd, rs1, rs3, rs2

tmp[63:0] = [rs3, rs1] >> rs2

rd = tmp[31:0]

Is it a little-endian concept?

Add sext.h and sext.b (Zbb)

Moving this discussion here from
https://groups.google.com/a/groups.riscv.org/forum/?utm_medium=email&utm_source=footer#!msg/isa-dev/0emw3Y8ZNxY/eUT5_IzaAwAJ.

The proposal is to add dedicated sext.h and sext.b instructions.

uint_xlen_t sext_h(uint_xlen_t rs)
{
    int shamt = XLEN - 16;
    return sra(sll(rs, shamt), shamt);
}

uint_xlen_t sext_b(uint_xlen_t rs)
{
    int shamt = XLEN - 8;
    return sra(sll(rs, shamt), shamt);
}

The encoding cost would be minimal because these are unary instructions.

The hardware cost would be acceptable, if the instruction is being used.

The main argument for this instruction is that the RISC-V calling convention requires arguments < 32 bit to be sign/zero extended according to their type. For example:

extern "C" int foo(short);

int bar(int a, int b) {
    return foo(a+b);
}

Is being compiled to the following without B extensions:

bar(int, int):
        addw    a0,a0,a1
        slliw   a0,a0,16
        sraiw   a0,a0,16
        tail    foo

And could be compiled to the following with sext.w:

bar(int, int):
        addw    a0,a0,a1
        sext.h  a0,a0
        tail    foo

The expectation is that function arguments < 32 bit may be common in code that is ported to RISC-V from smaller 8-bit or 16-bit micro controllers.

With those instructions added we would be able to zero-extend or sign-extend any 8-, 16-, or 32-bit value in a single instruction:

Width	sign ext	zero ext
8	sext.b rd,rs	andi rd,rs,255
16	sext.h rd,rs	pack[w] rd,rs,zero
32	addw rd,rs,zero	pack rd,rs,zero

Most *W instructions don't have sufficient justification

One of the frequently asked questions listed in the document is:

Do we really need all the *W opcodes for 32 bit ops on RV64?

In my opinion, the only *W instructions in the B extension that might have significant value are the rotate instructions, RORW, ROLW, RORIW, and maybe PACKW. To help decide, the document proposes running "proper experiments with compilers that support those instructions". That would be ideal of course, but I'm skeptical the community will wait on the B extension long enough for that to happen. (Plus it's not exactly easy to set up unbiased experiments for these kinds of specialized features.)

In the meantime, I'd like to point out that most of the proposed *W instructions can be substituted by a sequence of only 2 or 3 other instructions. The following are believed to be equivalent sequences (not always unique):

CLZW rd,rs

SLOI rd,rs,32
CLZ rd,rd

CTZW rd,rs

LI temp,-1
PACK rd,temp,rs
CTZ rd,rd

PCNTW rd,rs

PACK rd,zero,rs
PCNT rd,rd

SLOIW rd,rs1,i

SLOI rd,rs1,i
SEXT.W rd,rd

SLOW rd,rs1,rs2

ANDI temp,rs2,31
SLO rd,rs1,temp
SEXT.W rd,rd

SROIW rd,rs1,i

If i > 0:

SLLI rd,rs1,32
SROI rd,rd,(i+32)

If i = 0:

SEXT.W rd.rs1

SROW rd,rs1,rs2

NOT temp,rs1
SRLW rd,temp,rs2
NOT rd,rd

GREVIW rd,rs1,i

GREVI rd,rs1,i
SEXT.W rd,rd

GREVW rd,rs1,rs2

ANDI temp,rs2,31
GREV rd,rs1,temp
SEXT.W rd,rd

SHFLIW rd,rs1,i

SHFLI rd,rs1,i
SEXT.W rd,rd

SHFLW rd,rs1,rs2

ANDI temp,rs2,15
SHFL rd,rs1,temp
SEXT.W rd,rd

UNSHFLIW rd,rs1,i

UNSHFLI rd,rs1,i
SEXT.W rd,rd

UNSHFLW rd,rs1,rs2

ANDI temp,rs2,15
UNSHFL rd,rs1,temp
SEXT.W rd,rd

BEXTW rd,rs1,rs2

If rs2 is a known constant, rs2[63:32] = 0, and at least one bit in rs2[31:0] is a zero (very likely):

BEXT rd,rs1,rs2

If rs2 is a known constant, rs2[63:32] != 0 (unlikely), and at least one bit in rs2[31:0] is a zero:

PACK temp,zero,rs2
BEXT rd,rs1,temp

If rs2 is a known constant and rs2[31:0] = 0xFFFFFFFF (unlikely):

SEXT.W rd,rs1

If rs2 is not a known constant:

PACK temp,zero,rs2
BEXT rd,rs1,temp
SEXT.W rd,rd

BDEPW rd,rs1,rs2

If rs2 is a known constant and rs2[63:31] = 0:

BDEP rd,rs1,rs2

Else:

BDEP rd,rs1,rs2
SEXT.W rd,rd

CLMULW rd,rs1,rs2

CLMUL rd,rs1,rs2
SEXT.W rd,rd

FSLW rd,rs1,rs2,rs3

PACK rd,rs3,rs1
FSL rd,rd,rs2,rd
SEXT.W rd,rd

FSRW rd,rs1,rs2,rs3

PACK rd,rs1,rs3
FSR rd,rd,rs2,rd
SEXT.W rd,rd

Note that a final SEXT.W rd,rd can be eliminated if the rd result is known to be used only in subsequent 32-bit operations (such as SW or other *W instructions). Other optimizing tweaks are also possible, depending on the circumstance.

Unless the proposed *W instructions can be shown to be much more prevelant than I expect, the combination of rare utility plus relatively easy synthesis from other instructions argues strongly for dropping them.

The document also says:

But they add very little complexity to the core. So the only question is if it is worth the encoding space.

While "very little complexity" may be true, I disagree that it should be dismissed and only encoding space considered. "Very little complexity" certainly lowers the threshold of utility an instruction must demonstrate to be acceptable, but it doesn't make the instruction free to add. There are many other possible instructions of very little complexity that we so far choose to exclude, and these *W instructions perhaps should be among them.

For instance, there's a whole category of instructions that would have more impact but aren't currently included, and that is the unsigned equivalents of the existing RV64I *W instructions: ADDWU, SUBWU, SLLWU, etc. These would be just like the existing *W instructions but instead zeroing the upper 32 bits, as appropriate for an unsigned int or uint32_t result type rather than int or int32_t. We hardly need to run any experiments to know that such *WU instructions would be used far more frequently than the *W instructions proposed for the B extension.

Rename ADDUW, SUBUW, SLLIUW

The new "prefix zero-extend" instructions currently have names that don't follow the existing convention for *W instructions. In particular, these new instructions do not act at all like instructions DIVUW and REMUW, which perform their operation on two unsigned 32-bit values and then sign-extend the 32-bit result.

To avoid confusion, the new instructions need different names. I propose

ADDZX
SUBZX
SLLZXI

where "ZX" stands for "zero-extend". For example, instruction

SLLZXI rd,rs1,i

acts the same as the sequence

ZEXT.W rd,rs1
SLLI rd,rd,i

grev/grevi issue with rs2 vs imm

It starts off by listing both grev ... rs2 and grevi ... imm.

Then the first paragraph says "It takes in a single register value and an immediate ..." which conflicts with the above and suggests the last operand must be an immediate.

Then the second paragraph says "This operation iteratively checks each bit i in rs2 ..." which suggests that the last operand must be a register.

Typo: "For any values of A, C and C"

Should

riscv-bitmanip/texsrc/bext.tex

Line 1176 in 24df418

For any values of {\tt A}, {\tt C}, and {\tt C}:

In fact read "For any values of A, B and C"?

SLLI, SRLI, ... ROR: Confusion about encoding

Hi all,

I don't understand the encoding for the SLLI, SRLI, ... ROR family of instructions.
Bitmanip v0.9 spec, page 35:

The SLLI* instruction aforementioned is not defined this way in the spec. There
is a 6-bit immediate field, not 7, as shown below for rv64i (page 30 of ISA spec v2.2):

Moreover the assembler code treats the first two fields like a funct6, followed by a 6-bit immediate:

... opcode/riscv.h:
#define OP_MASK_SHAMT		0x3f
#define OP_SH_SHAMT		20
... riscv-opc.c:
#define USE_BITS(mask,shift)    (used_bits |= ((insn_t)(mask) << (shift)))
...
      case '>': USE_BITS (OP_MASK_SHAMT,        OP_SH_SHAMT);   break;

Could someone please explain what's going on here? And if this is correct, should it not have its own encoding format?

Incorrect operands type for bswapdi2 in gcc/config/riscv/bitmanip.md

Hello,

Currently bswapdi2 is defined in bitmanip.md as follows:

(define_insn "bswapdi2"
  [(set (match_operand:SI 0 "register_operand" "=r")
	(bswap:SI (match_operand:SI 1 "register_operand" "r")))]
  "TARGET_64BIT && TARGET_BITMANIP"
  "grevi\t%0,%1,0x38"
  [(set_attr "type" "bitmanip")])

Looks like SI should be replaced with DI in this definition.

My test was a function listed below (originally defined in libgcc/libgcc2.c):

typedef long DItype;

DItype
__bswapdi2 (DItype u)
{
  return ((((u) & 0xff00000000000000ull) >> 56)
          | (((u) & 0x00ff000000000000ull) >> 40)
          | (((u) & 0x0000ff0000000000ull) >> 24)
          | (((u) & 0x000000ff00000000ull) >>  8)
          | (((u) & 0x00000000ff000000ull) <<  8)
          | (((u) & 0x0000000000ff0000ull) << 24)
          | (((u) & 0x000000000000ff00ull) << 40)
          | (((u) & 0x00000000000000ffull) << 56));
}

Before replacement of SI with DI the function assembly was broken:

__bswapdi2:
        addi    sp,sp,-16
        sd      ra,8(sp)
        call    __bswapdi2    /// <-- infinite recursion
        ld      ra,8(sp)
        addi    sp,sp,16
        jr      ra

After replacement it looks good:

__bswapdi2:
        grevi   a0,a0,0x38
        ret

Please, make a fix.

gcc experimenting

I tried adding gcc optimization support for the b extension. This is one day of work, so I only added the easy ones, didn't verify the results with execution, and haven't tried to handle every case. The assembler is missing support for the rev and zext aliases but I can emit the pack and grevi instructions for those. The assembler is missing support for the addwu, subwu, addu.w, subu.w, and slliu.w instructions, so those are disabled though I am able to generate them.

This patch doesn't affect dhrystone, but for coremark I see a 280 byte reduction in size, which is about 1.5%, with 99 pack instructions and 3 max instructions. Then I realized I had the signed ee_u32 hack in my tree, so I tried undoing that. Now I see a 384 byte reduction in size, which is about 2%, with 190 pack instructions, 3 max instructions, and 1 maxu instruction. We can perhaps get better results with support for the missing addwu etc instructions.

gcc-b-support.patch.txt
tmp.c.txt

maybe use pcnt(rs1 ^ rs2) ?

In https://groups.google.com/d/msg/comp.arch/8MR8_O-wCeE/8pyiGYz8AQAJ Pedro Pereira suggests:

In the latest bitmanip extension document, the popcount opcode is defined as:

rd = pcnt(rs)

a more useful primitive would be:

rd = pcnt(rs1 ^ rs2)

Since the RISC-V has a zero register (x0), the suggested version could
encode the first one as "rd = pcnt(rs ^ x0)".

I don't imagine that reading one extra register and
performing a xor would make the instruction need more cycles.

rvintrin.h file name

This is a very general name. Other extensions will also want intrinsics files. Perhaps B extension intrinsics should be in a file with a name like rvb-intrin.h to make it clear that these are RISC-V B extension intrinsics.

If we are putting all intrinsics in one file, then we may need to conditionalize them based on whether that particular extension is enabled. That requires a macro as per issue #28.

Should shift immediate for SLLIU.W be 6 bits?

The document says

slliu.w is identlical to slli, except that bits XLEN-1:32 of the rs1 argument are cleared before the shift.

However, the proposed encoding for SLLIU.W shows another difference: Unlike the RV64 SLLI, the shift immediate for SLLIU.W is only 5 bits, supporting a maximum shift of 31 bits. If this was intentional, the limitation should be explained. Otherwise, it looks like the encoding will need to be changed to accomodate a 6-bit shift for RV64 and, I presume, a 7-bit shift eventually for RV128.

REPACK[I] (Zbf) and PACKB (Zbb)

Reflecting on https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/0emw3Y8ZNxY I'd like to propose two new instructions for packing structs of bitfields and bytes.

Note that the RISC-V calling convention requires structs that fit in a register to be passed in a register when passed by value. This means that in the worst case we need to pack that register on each function call.

REPACK

First, a REPACK[I] instruction (in Zbf) with the following semantic would help improve the performance of bitfield packing.

uint_xlen_t repack(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int shamt = rs2 & (XLEN-1);
    uint_xlen_t lower = (rs1 << XLEN/2) >> XLEN/2;
    uint_xlen_t upper = (rs2 >> XLEN/2) << shamt;
    uint_xlen_t mask = ~uint_xlen_t(0) << shamt;
    return (upper & mask ) | (lower & ~mask);
}

That is, take the upper half of rs1, and place it over the lower half of rs1 at the offset specified at rs2. This could likely re-use most of the circuitry for BFP, the other instruction in Zbf.

Packing N data registers D0,D1,..,D(N-1) into a bit field, using the lengths L0,L1,..,L(N-1):

PACK a0,D0,D1
REPACKI a0,a0,L0
PACK a0,a0,D2
REPACKI a0,a0,L1+L2
...
PACK a0,a0,D(N-1)
REPACKI a0,a0,L1+L2+...+L(N-2)

A word with N bitfields can be packed in 2*(N-1) instructions this way, when L1+L1+...+L(N-2) < XLEN/2 and L(i) <= XLEN/2 forall i in 0..(N-1). Only a few extra instructions are needed to stitch together the larger pieces in the remaining cases.

The main difference in use-case between REPACK and BFP is that the former is primarily useful for constructing a new struct of bitfields from its members whereas the latter is primarily useful when overwriting one particular bitfield in an such an existing struct, usually as part of a read-modify-write pattern.

PACKB

A Pack Bytes (PACKB) instruction in Zbb would help to pack structs of bytes.

uint_xlen_t packb(uint_xlen_t rs1, uint_xlen_t rs2)
{
    return (rs1&255) | ((rs2&255)<<8);
}

This would allow packing of 4 bytes into a 32-bit word in 3 instructions instead of 5 and would only require "Zbb" (that is it would not require SHFL, unlike the 5 instruction solution):

PACKB a0, a0, a1
PACKB a1, a2, a3
PACK[W] a0, a0, a1

Clarification on cmix documentation

I need a small clarification. The description for cmix instruction says that -

It is equivalent to the following sequence.
and rd, rs1, rs2
andn t0, rs3, rs2
or rd, rd, t0

Is it implied that the register t0 will be modified as a result of the execution of this instruction?

Pull request outstanding

@cliffordwolf
Hi Clifford, could you please take a look at the pull request I made a week ago to review whether it can be merged or I need to make some changes ?
Many Thx
Lee

submitted gcc patch for shNadd and shNaddu.w support

riscvarchive/riscv-gcc#187
no execution testing on simulators, only compilation testing

Enhanced functionality for GREV / GREVI

I propose a compatible, useful, and low cost enhancement to the GREV/GREVI Generalized Reverse instructions.

At present each stage of GREV can either swap each pair of bits or else propagate them unchanged, as determined by the SHAMT bit for that stage.

I propose to perform some other function on each pair that would normally be swapped, with the same function being substituted for "swap" at each stage. The function to be performed is specified by one or more currently unused bits in rs2 or imm e.g. bit 6, or bits 6-7, or perhaps higher numbered bits to allow for 128 bit CPUs.

Supposing two bits are used, the encoding might be:

00: swap the two bits
01: both outputs are the OR of the two input bits
10: both outputs are the AND of the two input bits
11: I don't have a candidate. XOR isn't useful.

Due to DeMorgan's laws it is not necessary to provide both AND and OR, so if a useful 4th function can't be thought of then perhaps only one bit should be used.

EFFECT

I anticipate that OR or AND processing would normally be added to grev.w. grev.h, grev.b, grev.n, or grev.p. I have not evaluated whether use with other mask values is useful.

When used with one of the above masks the effect of OR instead of SWAP is to set the entire field to 1s if any bit in the field is 1. Fields consisting entirely of 0s remain as 0s. The effect of AND is to set the entire field to 0s if any bit in the field is 0. Fields consisting entirely of 1s remain as 1s.

For example, with an input of cbf20097147200ac the output of GREV.B.OR is ffff00ffffff00ff.

APPLICATIONS

If the output of the above GREV.B.OR is inverted to 0000ff000000ff00 then CLZ or CTZ can be used to determine the position of the first or last zero byte in the input value.

Alternatively, the input value could be inverted and then GREV.B.AND will produce the necessary input for CLZ or CTZ.

This is very valuable in efficient implementation of C string processing functions such as strlen(), strcpy(), strcmp().

Along with general benefits, this will provide a large boost to RISC-V scores in Dhrystone.

Using GREV.H.OR provides the same functionality for UTF-16 or UCS-2, or GREV.W.OR for UTF-32 or UCS-4.

COST

GREV is dominated by wire cost. The logic at each node is extremely small and increasing its size will not meaningfully impact the cost of GREV in either SoC or FPGA.

In particular, in FPGAs with splitable 6-LUTs we have five inputs (the two input bits, the SHAMT swap/fn enable for the stage, and my proposed two function select bits) and these inputs determine two independent bit outputs -- a perfect fit.

The attached program provides a reference C implementation of the proposed modification, checks that it produces the same output as the existing reference implementation when the function select bits are 0, and demonstrates finding all-zero fields of widths 4 to 32.

grev.txt

Consider changing BFP not to wrap around new field in result

As instruction BFP is currently defined, the bit field it overlays in the rs1 value may wrap around to span both the high (most-significant) and low (least-significant) ends of the result, For example, this sequence,

li t0,12<<24|26<<16|0xABCD
bfp t1,zero,t0

for RV32, leaves t1 with the value 0x340002AF, because the 16-bit value 0xABCD shifted left 26 bits (without clipping) is 0x2AF34000000, and this value wraps around from high bits to low bits in the result.

Are there expected advantages to this wrapping? My analysis indicates that the hardware for BFP (and for the B extension generally) can be a little reduced by not defining BFP to wrap around this way. The basic reason is because BFP requires the hardware to separately create a mask in addition to shifting (or rotating) rs2[15:0], and forcing this mask to wrap around adds a little extra circuitry.

I can imagine some applications might benefit from wrapping around, while others benefit more from not wrapping around. If there are good reasons to prefer wrap-around, I suggest adding an explanation of that choice to the document. If not, I propose modifying the specified behavior to use shifts instead of rotations, like so:

uint_xlen_t bfp(uint_xlen_t rs1, uint_xlen_t rs2)
{
    int len = (rs2 >> 24) & 15;
    int off = (rs2 >> 16) & (XLEN-1);
    len = len ? len : 16;
    uint_xlen_t mask = slo(0, len) << off;
    uint_xlen_t data = rs2 << off;
    return (data & mask) | (rs1 & ~mask);
}

For my example above, the value left in t1 would then be 0x34000000.

(Note that, the hardware could still quietly substitute

uint_xlen_t data = rol(rs2, off);

for computing data without changing the behavior, if that's more convenient. The issue is just with the rotation of the mask.)

Should be ANDN instead of ANDC.

The long-established name of the logic operation that ANDs two Boolean inputs and complements the result is NAND, not CAND. RISC-V assembly language has a pseudo-instruction for a bitwise complement called NOT, not COMPL or whatever. This draft extension includes instructions called NAND, NOR, and C.NOT. For consistency, shouldn't the instruction that computes a bitwise AND with the complement of the second operand be called ANDN instead of ANDC?

Consider renaming CLMUL to XMUL

I request that instruction CLMUL, standing for "carry-less multiply", be renamed to XMUL, for "XOR multiply", meaning a multiplication where the partial products are summed by bitwise XORs instead of the usual additions. My reason is simply that I find the name "carry-less multiply" to be awkward, and I'm probably not alone. The name "carry-less multiply" appears not to be entrenched except in connection to x86 processors.

I attempted a search, and as far as I can tell, the "CLMUL" name has been adopted as part of a standard ISA only for the x86 (with SIMD instruction PCLMULQDQ). The B extension draft notes that the equivalent SPARC instruction is XMULX, officially documented as "XOR multiply". Most Web references to "carry-less multiply", "carry-less multiplication", and "carry-less product" seem to point back one way or another to Intel's CLMUL instruction. Also, there appears as yet to be no __builtin_clmul in GCC. The path should therefore be clear for us to choose the name XMUL if others agree with me it would be preferable.

internal compiler error: in decompose, at rtl.h:2279

Hello,

The compilation of source foo.c (see below) fails with internal error when the compiler is run as follows:

$ riscv64-unknown-elf-gcc -O2 -march=rv64ib -mabi=lp64 -S -o foo.S foo.c
during RTL pass: combine
foo.c: In function ‘foo’:
foo.c:4:1: internal compiler error: in decompose, at rtl.h:2279

Source code foo.c:

int foo(int n)
{
    return n + 0x7fffffff;
}

Additional notes:

Replacing -O2 with -O{1,0,s} does the same. When -O is omitted, the program compiles.
When -march=rv64ib is replaced with -march=rv64i, the program compiles.
When -march=rv64ib -mabi=lp64 is replaced with -march=rv32ib -mabi=ilp32, the program compiles.

gcc -v log:

[user@s01 bug]$ ../riscv64b/bin/riscv64-unknown-elf-gcc -Os -march=rv64ib -mabi=lp64 -v -S -o foo.S foo.c
Using built-in specs.
COLLECT_GCC=../riscv64b/bin/riscv64-unknown-elf-gcc
Target: riscv64-unknown-elf
Configured with: ../riscv-gcc/configure --prefix=/home/user/riscv-bitmanip/riscv64b --target=riscv64-unknown-elf --enable-languages=c --disable-libssp
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 10.0.0 20190929 (experimental) (GCC) 
COLLECT_GCC_OPTIONS='-Os' '-march=rv64ib' '-mabi=lp64' '-v' '-S' '-o' 'foo.S'
 /home/user/riscv-bitmanip/riscv64b/libexec/gcc/riscv64-unknown-elf/10.0.0/cc1 -quiet -v foo.c -quiet -dumpbase foo.c -march=rv64ib -mabi=lp64 -auxbase-strip foo.S -Os -version -o foo.S
GNU C17 (GCC) version 10.0.0 20190929 (experimental) (riscv64-unknown-elf)
        compiled by GNU C version 4.8.5 20150623 (Red Hat 4.8.5-39), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
ignoring nonexistent directory "/home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/../../../../riscv64-unknown-elf/sys-include"
#include "..." search starts here:
#include <...> search starts here:
 /home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/include
 /home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/include-fixed
 /home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/../../../../riscv64-unknown-elf/include
End of search list.
GNU C17 (GCC) version 10.0.0 20190929 (experimental) (riscv64-unknown-elf)
        compiled by GNU C version 4.8.5 20150623 (Red Hat 4.8.5-39), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Compiler executable checksum: bff2308ac495c30be0d25ad6caff4627
during RTL pass: combine
foo.c: In function ‘foo’:
foo.c:4:1: internal compiler error: in decompose, at rtl.h:2279
    4 | }
      | ^
0x556a3e wi::int_traits<std::pair<rtx_def*, machine_mode> >::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> const&)
        ../../riscv-gcc/gcc/rtl.h:2277
0xbd4961 wi::int_traits<std::pair<rtx_def*, machine_mode> >::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> const&)
        ../../riscv-gcc/gcc/wide-int.h:3102
0xbd4961 wide_int_ref_storage<std::pair<rtx_def*, machine_mode> >
        ../../riscv-gcc/gcc/wide-int.h:1032
0xbd4961 generic_wide_int<std::pair<rtx_def*, machine_mode> >
        ../../riscv-gcc/gcc/wide-int.h:790
0xbd4961 add<std::pair<rtx_def*, machine_mode>, std::pair<rtx_def*, machine_mode> >
        ../../riscv-gcc/gcc/wide-int.h:2422
0xbd4961 simplify_const_binary_operation(rtx_code, machine_mode, rtx_def*, rtx_def*)
        ../../riscv-gcc/gcc/simplify-rtx.c:4318
0xbd9cde simplify_binary_operation(rtx_code, machine_mode, rtx_def*, rtx_def*)
        ../../riscv-gcc/gcc/simplify-rtx.c:2156
0x1227a71 combine_simplify_rtx
        ../../riscv-gcc/gcc/combine.c:5804
0x122a492 subst
        ../../riscv-gcc/gcc/combine.c:5726
0x122a108 subst
        ../../riscv-gcc/gcc/combine.c:5667
0x122c4ed try_combine
        ../../riscv-gcc/gcc/combine.c:3422
0x12323d8 combine_instructions
        ../../riscv-gcc/gcc/combine.c:1305
0x12323d8 rest_of_handle_combine
        ../../riscv-gcc/gcc/combine.c:15066
0x12323d8 execute
        ../../riscv-gcc/gcc/combine.c:15111
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

System info: CentOS 7, uname -a: Linux XXXXXXXXX 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

documentation issue

@cliffordwolf
Hi Clifford,
I have a query regarding the documented behavior for fsl, the text says that
The fsl rd, rs1, rs2, rs3 instruction creates a 2 x XLEN word by concatenating rs1 and rs3
(with rs1 in the MSB half)
from the pseudo code, it looks as though rs1 is in the lower half.
can you clarify which is correct ?
Thx
Lee

ANDC Bit Encoding

So I was looking at the encoding a lot are not fully described, but this one seems obvious too me.
| ??????? | rs2 | rs1 | ??? | rd | 0110011 | ANDC

Here is the current AND:
| 0000000 | rs2 | rs1 | 111 | rd | 0110011 | AND

Why not define ANDC much like Shift Right and Arithmetic shift right. Also ADD
to SUB is very similar. So using this bit to denote negation seems pretty intuitive.
| 0100000 | rs2 | rs1 | 111 | rd | 0110011 | ANDC

While probably not as useful as ANDC it's also logical easy to extend the complemented
inputs for the other bitwise instructions like this.

| 0100000 | rs2 | rs1 | 100 | rd | 0110011 | XORC
| 0100000 | rs2 | rs1 | 110 | rd | 0110011 | ORC

Also when it comes to hardware implementation negation is used when converting a positive number to a negative number (2's complement). Therefore, it would be easy to use the same negation hardware if the same bit was used to denote negation. Also keeping func3 the same means an ALU only has worry about choosing to negate an input.

Implement "-march=rv32ib_Zbb_Zbc_...ZbX" ISA subset selection

Hi all,

I have put together a small patchset which implements command-line bitmanip ISA subset selection. I've tried to follow the current bitmanip draft spec as closely as possible. In particular, the 'B' bitmanip subset is taken as the one indicated by the extended dotted line (everything excluding Zbt, Zbf.)

Please also read the "Problems" section at the bottom.

The bitmanip spec currently defines 9 subgroups of instructions. I defined them as target flags residing in a new target variable called "x_riscv_bitmanip_flags", each called accordingly:

OPTION_MASK_BITMANIP_ZBB
OPTION_MASK_BITMANIP_ZBC
OPTION_MASK_BITMANIP_ZBE
OPTION_MASK_BITMANIP_ZBF
OPTION_MASK_BITMANIP_ZBM
OPTION_MASK_BITMANIP_ZBP
OPTION_MASK_BITMANIP_ZBR
OPTION_MASK_BITMANIP_ZBS
OPTION_MASK_BITMANIP_ZBT

When invoking gcc as follows:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c

the flag states become

ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1    1    1    0    1    1    1    1    0    1

If the user provides atleast one sub-ISA specifier, then only the sub-ISA flags are honoured. e.g.:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib_Zbb -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c

will set

ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1    0    0    0    0    0    0    0    0    1

and all following sub-ISA specifiers are simply additive, e.g.:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib_Zbb_Zbf_Zbt -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c

will set

ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1    0    0    1    0    0    0    0    1    1

and so forth. The user must provide the 'b' keyword before adding (if any) "ZbX" directives, and the "ZbX" directives must always follow directly after the 'b' directive. The "ZbX" directives must always have atleast one set of underscores surrounding them. If there are multiple "ZbX" directives, they must come one after the other.

The "MASK_BITMANIP" target macro is still there, but it is not used in the bitmanip.md conditions.

There are 3 patches. Each can be applied without breaking the build, but they must come in the right order.
Patch 1: add riscv.opt "riscv_bitmanip_flags" variable, and associated masks.
Patch 2: modify bitmanip.md insns to only get generated if the associated subset mask is set.
Patch 3: modify riscv-common.c to accept "ZbX" form of subset ISA flags.

I've also provided a patch that applies all of them at once.

There is almost certainly some fiddling to be done with the riscv.c file aswell, and I suspect there to be a few corner cases in the parser, but this is useful as it is and would like to see what people think of the current implementation.

Output assembly arch attributes look like this:
.attribute arch, "rv32i2p0_b2p0_Zbb2p0_Zbc2p0_Zbt2p0_Zbp2p0"

Patches:
0001-add-bmi-subisa-march-opts.patch.txt
0002-add-bmi-subisa-march-bitmanip.patch.txt
0003-add-bmi-subisa-march-common.patch.txt
add-bmi-subisa-march-all.patch.txt

Problems:

Canonical flags order.

No order of flags is currently enforced. What is the canonical order of the ZbX flags? Alphabetical? Or something else?

Redundant flag names.

Unfortunately, GCC refuses to create target masks relative to a specified variable if a flag name is not provided. I am referring to the riscv.opt file.

Take for example the ZBB directive:

...
mbmi-zbb
Target Mask(BITMANIP_ZBB) Var(riscv_bitmanip_flags)
Support the base subset of the Bitmanip extension.
...

This causes the following code to be generated in build/gcc/options.h:

#define OPTION_MASK_BITMANIP_ZBB (HOST_WIDE_INT_1U << 0) // <<<<<<<<<<<<<<<<
#define OPTION_MASK_BITMANIP_ZBC (HOST_WIDE_INT_1U << 1)
#define OPTION_MASK_BITMANIP_ZBE (HOST_WIDE_INT_1U << 2)
#define OPTION_MASK_BITMANIP_ZBF (HOST_WIDE_INT_1U << 3)
#define OPTION_MASK_BITMANIP_ZBM (HOST_WIDE_INT_1U << 4)
#define OPTION_MASK_BITMANIP_ZBP (HOST_WIDE_INT_1U << 5)
#define OPTION_MASK_BITMANIP_ZBR (HOST_WIDE_INT_1U << 6)
#define OPTION_MASK_BITMANIP_ZBS (HOST_WIDE_INT_1U << 7)
#define OPTION_MASK_BITMANIP_ZBT (HOST_WIDE_INT_1U << 8)
#define MASK_DIV (1U << 0)
#define MASK_EXPLICIT_RELOCS (1U << 1)
#define MASK_FDIV (1U << 2)
#define MASK_SAVE_RESTORE (1U << 3)
#define MASK_STRICT_ALIGN (1U << 4)
#define MASK_64BIT (1U << 5)
#define MASK_ATOMIC (1U << 6)
#define MASK_BITMANIP (1U << 7)
#define MASK_DOUBLE_FLOAT (1U << 8)
#define MASK_HARD_FLOAT (1U << 9)
#define MASK_MUL (1U << 10)
#define MASK_RVC (1U << 11)
#define MASK_RVE (1U << 12)

We probably don't want to expose this dual method of specifying subisas, so let's try removing the -mbmi-zbb name from riscv.opt:

...
Target Mask(BITMANIP_ZBB) Var(riscv_bitmanip_flags)
...

This causes the following output in build/gcc/options.h:

#define OPTION_MASK_BITMANIP_ZBC (HOST_WIDE_INT_1U << 0)
#define OPTION_MASK_BITMANIP_ZBE (HOST_WIDE_INT_1U << 1)
#define OPTION_MASK_BITMANIP_ZBF (HOST_WIDE_INT_1U << 2)
#define OPTION_MASK_BITMANIP_ZBM (HOST_WIDE_INT_1U << 3)
#define OPTION_MASK_BITMANIP_ZBP (HOST_WIDE_INT_1U << 4)
#define OPTION_MASK_BITMANIP_ZBR (HOST_WIDE_INT_1U << 5)
#define OPTION_MASK_BITMANIP_ZBS (HOST_WIDE_INT_1U << 6)
#define OPTION_MASK_BITMANIP_ZBT (HOST_WIDE_INT_1U << 7)
#define MASK_DIV (1U << 0)
#define MASK_EXPLICIT_RELOCS (1U << 1)
#define MASK_FDIV (1U << 2)
#define MASK_SAVE_RESTORE (1U << 3)
#define MASK_STRICT_ALIGN (1U << 4)
#define MASK_64BIT (1U << 5)
#define MASK_ATOMIC (1U << 6)
#define MASK_BITMANIP (1U << 7)
#define MASK_DOUBLE_FLOAT (1U << 8)
#define MASK_HARD_FLOAT (1U << 9)
#define MASK_MUL (1U << 10)
#define MASK_RVC (1U << 11)
#define MASK_RVE (1U << 12)
#define MASK_BITMANIP_ZBB (1U << 13) // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

It's now relative to the general-purpose target_flags variable, rather than the riscv_bitmanip_flags. Is there is some magic combination of GCC option properties to get around this?

SROW/SLOW shift masking

@cliffordwolf
Hi Clifford
In the spec the pseudo code for SROW is as follows

int shamt = rs2 & (XLEN - 1);
return ~(~rs1 >> shamt);

in the case of SROW, is XLEN the length of the machine (64 bit) or the target of the operation (32 bit)

in other words if
rs2 = 0xFFFF_FFFF_FFFF_FFFF
is the shift amount 31 or 63 ?

Benchmark - Variable Length Unary Integer Coding (clz/ctz)

This might be useful:

https://github.com/michaeljclark/vlu

VLU(...) is a little-endian variable-length integer coding that prefixes data bits with unary code length bits. The length is recovered by counting the least significant set bits, which encode a count of n-bit basic units. The data bits compactly trail the unary code prefix.

encode uses clz
decode uses ctz

With an 8 bit basic unit, the encoded size is similar to LEB128; 7-bits can be stored in 1 byte, 56-bits in 8 bytes and 112-bits in 16 bytes. Decoding, however, is significantly faster than LEB128, as it is not necessary to check for continuation bits every byte, instead the length can be decoded in a single count bits operation.

While VLU is not in major use, it could be substituted where LEB128 is used with reasonably significant benefits depending on the frequency of variable length fields. LEB128 probably performs similarly to VLU on a machine without bit scan forward and reverse. There are also potential SIMD or vector optimisations. For example, a decoder could have a predictor, and switch from a "per field" mode to a set of optimized modes. e.g. 128-bit SIMD code for parallel decoding of 16 x 7-bit fields.

There is symmetry in the encode and decode, with clz for figuring out the size of a word, and ctz to read the /prefix/ from the little-end. The code is a pretty good example of why little-endian makes more sense. The benchmarks currently perform decoding of 8-bit through to 56-bit and there is an optimized decoder for x86-64 BMI. I am investigating x86 SIMD and want to add support for big numbers. 112-bits and >= 128-bits.

Behavior of *w instructions

The base ISA defines ADDW and SUBW like so:

ADDW and SUBW are RV64I-only instructions that are defined analogously to ADD and SUB but operate on 32-bit values and produce signed 32-bit results. Overflows are ignored, and the low 32-bits of the result is sign-extended to 64-bits and written to the destination register.

The Bitmanip extension refers to clzw, ctzw, pcntw, etc. but doesn't actually define how they work. The pseudocode is only defined for the instructions that operate on data of size XLEN.

Sophisticated readers understand that the *w instructions in Bitmanip are almost certainly supposed to behave like the *W instructions in the base architecture (operating on 32-bit data and then sign-extending the result to XLEN). However, this should be explicitly addressed with some verbiage and pseudocode so that everything is fully specified and unambiguous.

CRC Polynomials and Encoding

I am not sure that hard coding the polynomials for CRC is a good idea. It seems pretty constrained to add a CRC instruction but only support 2 polynomials. Like the fact it does have 0xedb88320, what I call IEEE polynomial, and Castagnoli polynomial 0x82f63b78 does cover a large number of uses.

However, for instance here is a list of polynomials from Philip Koopman, and people may use different ones depending on the application.
https://users.ece.cmu.edu/~koopman/crc/index.html

Honestly, I think using funct7 | rs2 | rs1 | f3 | rd | opcode | R-type would be better than the unary format. This would allow one to load polynomials from rs2. It would also only need 2 bits in funct7 for B, H, W, and D. Also a single bit in funct7 could select between 0xedb88320 and 0x82f63b78. There are two ways I could see handling the predefined polynomials.

For instance if we wanted to have default to IEEE polynomial we can take advantage RS0 or the zero register by change the pseudo code to this. This also would allow anyone to load any other polynomial from a register, but it must be XOR’ed with polynomial by 0xedb88320 beforehand. For instance if I wanted to use this polynomial 0xeb31d82e XOR’ing gives me this 0x06895b0e which I can then use as the constant I load in the rs2 register.

uint_xlen_t crc32(uint_xlen_t rs1, uint_xlen_t rs2 int nbits) {
    for (int i = 0; i < nbits; i++) 
        rs1 = (rs1 >> 1) ^ ( ( 0xEDB88320 ^ rs2 ) & ~((rs1 & 1) - 1)); 
    return rs1; 
}

The other option would be to use a predefined polynomial if rs2 is the zero register. This avoids XOR’ing the desired polynomial. Although, I am not sure XOR’ing is problem since it can easily be done once outside of run time for a given polynomial.

Possible decode error for FSRI ?

@cliffordwolf
Hi Clifford,
I wonder if you could clarify something.
In my model containing the B extensions, I am getting a decode conflict reported for FSRI during our static analysis phase, could you clarify this for me?
Here are the decode definitions


    DECODE_ENTRY(0, RORI,     "|01100.......|.....|101|.....|0010011|");
    DECODE_ENTRY(0, SBEXTI,   "|01001.......|.....|101|.....|0010011|");
    DECODE_ENTRY(0, SROI,     "|00100.......|.....|101|.....|0010011|");
    DECODE_ENTRY(0, FSRI,     "|.....1......|.....|101|.....|0010011|");

a '.' indicates a wildcard, so as you can see FSRI overlaps with RORI, SBEXTI and SROI
the FSRI[26] is part of the immediate value in the RORI, SBEXTI and SROI instructions, and
FSRI[31:27] is rs3, but part of decode in RORI, SBEXTI and SROI

what are your thoughts ?
could this be a documentation error, and FSRI should be
DECODE_ENTRY(0, FSRI, "|.....1......|.....|101|.....|0110011|");
not
DECODE_ENTRY(0, FSRI, "|.....1......|.....|101|.....|0010011|");

Thx
Lee

sbclri: invalid format specifier

Hi all,

In the proposed patch for bitmanip Zbs family of instructions found here, the format specifier for sbclri doesn't work right because we're expecting a const_int with one bit low.

So, in order for `ctz_hwi' to work right, we need to invert the operand first, and for that we need a separate format specifier (or amend the const_int opcode in-place in the .md file, but I chose to do the former.)

The patch:
gcc-b-support-fix-sbclri.patch.txt

Testcase:

unsigned int
f(unsigned int a)
{
        return a & ~(1 << 29);
}

Before:

f:
        sbclri  a0,a0,0
        ret

After:

f:
        sbclri  a0,a0,29
        ret

Problem with installing the bit manipulation gcc compiler

Hello guys,
last day I struggled with building the bit manipulation patch for gcc compiler. when running the following command bash gcc-build.sh and after of course configuring the shell file. I face this error of missing configuration

Proposal: func7 "Quadrants" for OP[32/128][+/-IM] opcode family

RISC V context for this proposal

This "OP Quadrant" proposal below has global implications for RISC V instruction encoding, but I propose it here in BitManip, as this is the first extension (other than 'M') to need such organisation within func7. (func7 & func3 have the usual meaning for the R-type instruction format).

Note: the RISC V user ISA spec explicitly states that RV128 may introduce new 128 bit instructions into an OP128 major opcode, which is the reverse of what happened for RV64. I assume this will be the case, in discussion below.

Why Quadrants are needed

"Contiguous" reserved opcode space is a precious resource. RISC V has only three reserved major opcodes left for future standard extensions.

Up to now, the only values of func7 for instructions within the OP-INT family of major opcodes are 0b0000000, 0b0000001 (MUL/DIV), and 0b0100000 (SUB/SRA). Bitmanip will substantially expand the usage of func7 values. It is important this is done in a rational way, as func7 values chosen within OP will also have major side effects on OP-IM, OP32[IM], and OP128[IM].

Within OP32 and OP128, up to 50% of these major opcodes are available as contiguous reserved space (for func3 values = 0bX1X (where X = 0 or 1), ie: do not correspond to any "Q" or "W" instruction. Care needs to be taken not to punch "holes" into this space. (Unfortunately, two "M" instructions break this rule in OP32, reducing OP32 continuous free opcode space slightly)

Problems with proposed v0.90 BitManip encoding

The current BitManip v0.90 encoding proposal are bit problematic in this regard as it punches "holes" into "non-W" sections of OP32. These non-W sections otherwise form part of an unused 50% of OP32/OP128, and scattered holes within them will limit the long term usefulness for other future extensions. (An example of a "hole" created in OP32 is BDEPW, which has a func3 value of 0b010).

BitManip v0.90 also unnecessarily introduces a new two source R-type format specifically for one instruction, FSRI, which moves the rs2 register field to a new position. This will complicate implementation of superscalar out-of-order microarchitectures, and breaks the existing RISC V approach of keeping rs1 and rs2 in the same positions for every relevant instruction.

Why a Quadrant division is intrinsically imposed onto func7 organisation

The choice of 4x32 value Quadrants is not an arbitrary choice. It is in fact fundamental to the organisation of RV32, RV64 and RV128.

There is a 32 x value func7 constraint for I-type shift instructions with a 7 bit (RV128) immediate fields. For RV64, I-type shift instructions have a 6 bit immediate field and can encode 64 values in their remaining instruction bits, hence translating into a 64 x value func7 constraint. (Hence Quadrants A & B need to be created for these distinct 2x32 value subsets of func7).

Also, dividing up func7 into Quadrants is natural for ternary instructions, as blocks of 32 x func7 values are needed to introduce an "rs3" instruction format (hence Quadrant "D" needs to be created for such rs3-type instructions).

Quadrants in detail

Below is an outline of how func7 should be structured into Quadrants A-D, based on the last two bit values of func7 (shown below as ' | 00' to ' | 11' ):

Quadrant A1 (n=1): instructions with func7 = 0b00000 | 00

have matching I-type instruction in OP-IM/OP-32IM/OP-128IM
have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX0X
does NOT have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX1X

Quadrant A2 (n=29): instructions with func7 in range 0b00001 | 00 to 0b11101 | 00

(for func3=0bX01 only) have matching I-type instruction in OP-IM/OP-32IM/OP-128IM
have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX0X
does NOT have matching "W" or "Q" instruction in OP-32/OP-128 if func3 = 0bX1X

Quadrant A3 (n=2, but could grow if needed): instructions with func7 = 0b1111X | 00

(if func3=0bX01) unary instructions within OP[32/64/128]IM, ie: the lower 5 bits of imm12 field is replaced by a func5 operand, which specifies a unary function operating on rs1 and storing result in rd. The func5 operand can specify 32 unary functions, of which func5 values 0b00000 & 0b00001 are reserved for functions which are derived from taking the corresponding two input Quadrant A3 OP function, and applying the value of zero or one to one of the two inputs (to yield a unary function). The remaining 30 unary functions can be arbitrary unary functions.
otherwise the rules for Quadrant A2 also apply to Quadrant A3

Quadrant B (n=33): func7 value in range 0b00000 | 10 to 0b11111 | 10

same as Quadrant A, except does not have corresponding I-type instructions for OP128IM.

Quadrant C (n=32): func7 value in range 0b00000 | 01 to 0b11111 | 01

currently used only by MUL/DIV, which (unfortunately) punches a hole in unused OP32 opcode space, by putting W version instructions into func3=0bX1X. (Maybe OP128 can avoid doing this in future, to reserve a fully contiguous half of the OP128 major for non-"Q" instruction uses).
does not have matching I-type instruction
can have matching "W" or "Q" instruction in OP-32/OP-128 for any func3 value

Quadrant D (n=32): func7 value in range 0b00000 | 11 to 0b11111 | 11

reserved for ternary functions (ie: instructions with an additional rs3 field, or with an additional 5 bit immediate operand). In this case, the FSRI instruction (with 64 shift range) can be replaced with FSLI and FSRI instructions, each with a 32 shift range.
rs3 field exists if func3=0bXX1 otherwise imm5 exists if func3=0bXX0 (note the instruction is still placed within OP, and not OP-IM despite the existence of imm5 as there are two source register inputs)
can have matching "W" or "Q" instruction in OP-32/OP-128 for func3 values = 0bX0X

Example BitManip encoding using Quadrants

Below is an example of how the above quadrants can be used to organise the BitManip proposed instructions:

	func7	rs2	rs1	rd	opcode	func3►
	▼				▼	000	100	001	101	010	011	110	111
Group A1	00000.00	rs2	rs1	rd	0110011	ADD	XOR	SLL	SRL	SLT	SLTU	OR	AND
Group A2	01000.00	rs2	rs1	rd	0110011	SUB	XNOR	SBINV	SRA			ORN	ANDN
	00001.00	rs2	rs1	rd	0110011	ADDU.W	PACK	SBSET	GREV	MIN	MINU
	01001.00	rs2	rs1	rd	0110011	SUBU.W		SBCLR	SBEXT	MAX	MAXU
	00010.00	rs2	rs1	rd	0110011	ROL	ROR	SLO	SRO
	01010.00	rs2	rs1	rd	0110011	BDEP	BEXT	SHFL	UNSHFL
Group A3	11111.00	rs2	rs1	rd	0110011	CLMUL	CLMULR	CLMULH
Group B	xxxxx.10
Group C	00000.01	rs2	rs1	rd	0110011	MUL	DIV	MULH	DIVU	MULHSU	MULHU	REM	REMU
Group D	rs3/imm5.11	rs2	rs1	rd	0110011	FSLI	FSRI	FSL	FSR	CMOVI	CMOV		CMIX

Note 1: OP-IM, OP32 and OP-32IM are not shown as these are automatically implied by the quadrant in which each instruction is added
Note 2: RORI not included as can be replaced by FSRL/FSLI, and bitmatrix instructions not shown as these are RV64I only and best placed in OP32 with func3=0bX1X
Note 3: Unary instructions not shown, are placed into OP-IM in the slot occupied by CLMULH (ie: Group A3 with func3=0bX01).

riscv / riscv-bitmanip Goto Github PK

riscv-bitmanip's People

Stargazers

Watchers

Forkers

riscv-bitmanip's Issues

rd = rs2[15:0]|rs1[15:0]

tmp[63:0] = [rs3, rs1] >> rs2

rd = tmp[31:0]

REPACK

PACKB

RISC V context for this proposal

Why Quadrants are needed

Problems with proposed v0.90 BitManip encoding

Why a Quadrant division is intrinsically imposed onto func7 organisation

Quadrants in detail

Example BitManip encoding using Quadrants

Recommend Projects

Recommend Topics

Recommend Org