For example, this C function (chosen for simplicity, not because it's code I would wan

Vector cmp/mix intrinsics inhibit optimizations,about jbush001/nyuzitoolchain

Comments (50)

jbush001 commented on August 20, 2024

Excellent question. I would love to eliminate these intrinsics, because, as you pointed out, they preclude optimizations and duplicate functionality of existing instructions. I made a few attempts to do this in the past, but ran into issues with the type legalizer. It's highly possible I've just missed something obvious, so it would be great to have someone else take a look. I can spend some time this evening describing the approaches I've taken and problems I've run into if that would be helpful context.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

A summary of what didn't work before would be very useful.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

This architecture uses scalar registers as masks. This differs from AVX, which has a separate set of mask registers. My first attempt was to make the scalar register class able to store the mask type (v16i1) as well as the integer type:

def ScalarReg : RegisterClass<"VectorProc", [v16i1, i32, f32], 32, (add (sequence "S%u", 0, 27),
    FP_REG, SP_REG, LINK_REG, PC_REG)> {
    // Explicitly specify size, otherwise TableGen will use 16 (looking at v16i1 type)
    let Size = 32;
}

I made my compiler builtins for mix take int32s as parameters and comparisons return int32. Internally, I would bitcast between v16i1 and i32 types to match instruction patterns that took those types. My hope was that all arithmetic, loads, and stores would be done using i32s. Unfortunately, the optimizer would reorganize code so I would end up with arithmetic on v16i1. I was faced with the prospect of having to make all integer instruction patterns accept all combinations of v16i1 and i32. I couldn't find a way to make TableGen expand these cleanly without manually creating a ton of patterns (I already have patterns to match various combinations of vector/scalar operands). I also had to do a bunch of hacks to make pattern matching work properly, but I don't remember the details. I eventually gave up and backed that out to make everything use i32, which is how it works now. The change to back it out is here: ae7a490, which illustrates how I tried to implement it.

There's also a semi-coherent thread here where I'm talking to myself about this on the LLVM mailing list :)

http://lists.llvm.org/pipermail/llvm-dev/2013-July/063448.html

from nyuzitoolchain.

jbush001 commented on August 20, 2024

On further reflection, I believe supporting v16i1 as a native type would only require one extra arithmetic instruction pattern: (set v16i1:$dest, (OpNode v16i1:$src1, v16i1:$src2)). I think I may have overestimated the complexity of that solution.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Looking through my previous attempt, there were a lot of places where I was fighting the type system unnecessarily. For example, I defined vector comparison intrinsics that returned an i32 instead of v16i1:

def int_vp_mask_cmpi_ugt : Intrinsic<[llvm_i32_ty], [llvm_v16i32_ty, llvm_v16i32_ty], 
  	[IntrNoMem], "llvm.vectorproc.__builtin_vp_mask_cmpi_ugt">;

I was fairly new to LLVM when I attempted to implement this the first time. With a few years of experience digging around in LLVM, this seems more straightforward. Everything in the IR code should use v16i1 exclusively for masks rather than i32. The C compiler should expose masks as plain old integers with truncate/bitcast - bitcast/zext combinations to convert back and forth. This means it doesn't need LLVM intrinsics for mix and compare functions. So, to implement the clang builtin to compare to vectors and return an integer mask--int __builtin_nyuzi_mask_cmpi_ugt(veci16_t a, veci16_t b);

Value *MaskResult = Builder.CreateICmpULT(Ops[0], Ops[1])
Value *Scalarized = Builder.CreateBitCast(MaskResult, Builder.getInt16Ty());
return Builder.CreateZExt(Scalarized, Builder.getInt32Ty());

Likewise, the clang predication builtin (__builtin_nyuzi_vector_mixi(vec16i_t v1, vec16i_t v2, int mask)) converts the opposite direction:

Value *Truncated = Builder.CreateTrunc(Ops[2], Builder.getInt16Ty());
Value *Mask = Builder.CreateBitCast(Truncated, llvm::VectorType::get(Builder.getInt1Ty(), 16));
return Builder.CreateSelect(Mask, Ops[0], Ops[1]);

from nyuzitoolchain.

jbush001 commented on August 20, 2024

As I play with this more, remembering why this was troublesome. :)

The problem starts when lowering the INSERT_VECTOR_ELT (insert vector element) SDNode in NyuziTargetLowering. There is no native instruction to do this, so I want to lower it to a sequence of shifting a value by the lane index and doing a masked transfer of the scalar value into the register. First, I compute the mask with a shift of a constant:

  SDValue Mask = DAG.getNode(ISD::SRL, DL, MVT::i32, DAG.getConstant(0x8000, DL, MVT::i32),
                  Op.getOperand(2));

This is currently of type i32. I need to convert it into a v16i1 in order to use it with a vselect. Normally, I would use the same trick I used in the C++ compiler: first TRUNCATE it to a i16, then BITCAST it to a v16i1. However, this wont work because this is running in the legalize ops pass, which is after the legalize types pass (http://llvm.org/docs/CodeGenerator.html#selectiondag-instruction-selection-process). i16 is not a valid type, so it asserts here.

I can't do a shift directly on the v16i1 type, since that would assume it is shifting individual elements independently. Making the constant or shift be i16 won't work either (they are also illegal types).

My next thought is to lower that into a new Nyuzi specific pseudo instruction custom instruction inserter to create the machine instructions directly.

In NyuziISelLowering.cpp:

SDValue NyuziTargetLowering::LowerINSERT_VECTOR_ELT(SDValue Op,
                                                    SelectionDAG &DAG) const {
  SDLoc DL(Op);
  return DAG.getNode(NyuziISD::INSERT_VECTOR, DL, Op.getValueType(), 
                     Op.getOperand(0), Op.getOperand(1), Op.getOperand(2));
}

In NyuziInstrFormats.td:

def insertvector : SDNode<"NyuziISD::INSERT_VECTOR", SDTypeProfile<1, 3, 
  [SDTCisVec<0>, SDTCisSameAs<1, 0>, SDTCisEltOfVec<2, 1>, SDTCisInt<3>]>>;

In NyuziInstrInfo.td:

let usesCustomInserter = 1 in {
  def INSERT_VECTOR : Pseudo<
      (outs VR512:$dest),
      (ins VR512:$sourcevec, GPR32:$element, GPR32:$index),
      [(set v16i32:$dest, (insertvector v16i32:$sourcevec, i32:$element,
        i32:$index))]>;
}

def : Pat<((insertvector v16f32:$sourcevec, f32:$element, i32:$index),
    (INSERT_VECTOR v16f32:$sourcevec, f32:$element, i32:$index)>;

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Here is an experimental patch that partially implements this. It's not the right fix, but demonstrates some of the considerations for this change. It passes most of the llvm-lit tests, but runs into problems generating code for programs in the NyuziProcessor tree.
https://gist.github.com/jbush001/3bd7679425fedf74f612f58a2e690eb3

I've created a custom SDNode type vselecti32, which is like the vselect node, but, as the name implies, takes a i32 as the mask instead of a v16i1. This can be matched in complex patterns for masked moves, and has a custom lowering pattern. It's a kind of ugly hack.

A few observations:

The X86 backend appears to be converting an scalar value into an i1 vector with a bitcast:

SDValue
X86TargetLowering::LowerBUILD_VECTORvXi1(SDValue Op, SelectionDAG &DAG) const {
                                              SelectionDAG &DAG) const {
...
  if (ISD::isBuildVectorOfConstantSDNodes(Op.getNode())) {
    SDValue Imm = ConvertI1VectorToInteger(Op, DAG);
    if (Imm.getValueSizeInBits() == VT.getSizeInBits())
      return DAG.getBitcast(VT, Imm);
    SDValue ExtVec = DAG.getBitcast(MVT::v8i1, Imm);
    return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, ExtVec,
                        DAG.getIntPtrConstant(0, dl));
  }

When I tried to do this, it complained about not being able to match the bitcast node during
instruction selection, so they must be doing something else clever (there is a custom lowering function for the bitcast node)

When I generate vector compare builtins, it is creating suboptimal code:
```
cmpgt_f s0, v0, v1
store_32 s0, 48(sp)
load_u16 s0, 48(sp)
```
The store/load combination is code that was expanded by LLVM to zero extend the comparison value. My intrinsics just ignore the high bits.
It barfs on matching BUILD_VECTOR, which I haven't implemented yet, and I assume should turn into a constant pool load. Getting this working would also allow eliminating the use of vselecti32 from LowerVECTOR_SHUFFLE and LowerINSERT_VECTOR_ELT (in the case of a constant lane index). I looked through the constant pool code and it looks like it would not properly pack a v16i1 into a 16 bit value, as the smallest element size is a byte. However, I haven't confirmed this yet by trying it. If I could get bitcasting working during the legalize ops phase, I could store these in the constant pool as i32 values.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

Sorry for the radio silence, I only got to start working on this a couple days ago. I got comparisons, bitwise operations, loads/stores, rudimentary selects, and the calling convention to work, and then I got sidetracked by trying to remove the gather and scatter intrinsics (which you wisely just kept). Looking over the patch, it roughly matches what I did, but is much more comprehensive (clang, tests).

When I tried to do this, it complained about not being able to match the bitcast node during
instruction selection, so they must be doing something else clever (there is a custom lowering function for the bitcast node)

What did you do precisely? As i16 isn't legal, we can't just bitcast v16i1 to it, can we?

But as for not being able to select bitcasts, this looks like a job for these patterns (at the very end of NyuziInstrInfo.td)?

// Conversions
def : Pat<(v16f32 (bitconvert (v16i32 VR512:$src))), (v16f32 VR512:$src)>;
def : Pat<(v16i32 (bitconvert (v16f32 VR512:$src))), (v16i32 VR512:$src)>;
def : Pat<(f32 (bitconvert (i32 GPR32:$src))), (f32 GPR32:$src)>;
def : Pat<(i32 (bitconvert (f32 GPR32:$src))), (i32 GPR32:$src)>;

When I generate vector compare builtins, it is creating suboptimal code:

To clarify, does this always occur, or only when feeding the comparison result into a vselect32 or another operation that requires an i32? Something that should generate a single instruction is just comparing a vector and returning the resulting v16i1 in s0, and indeed I got that working locally:

  %positive = fcmp ogt <16 x float> %0, zeroinitializer
  ret <16 x i1> %positive
; =>
	load_32 s0, .LCPI0_0 ; .LCPI0_0 = float 0
	cmpgt_f s0, v0, s0
	ret

It barfs on matching BUILD_VECTOR, which I haven't implemented yet, and I assume should turn into a constant pool load.

I did setOperationAction(ISD::BUILD_VECTOR, MVT::v16i1, Expand) on a whim and it seems to generate correct code, even if the constant pool entry has type i32 and alignment 16, which is more than a bit iffy.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

What did you do precisely? As i16 isn't legal, we can't just bitcast v16i1 to it, can we?

I marked i16 as a valid type for the GPR32 register, which really doesn't seem right and probably breaks other stuff.

But as for not being able to select bitcasts, this looks like a job for these patterns

Heh... oops, right. It's been a few years and I had forgotten about that.

To clarify, does this always occur, or only when feeding the comparison result into a vselect32 or another operation that requires an i32?

I believe any time it converts to i32 (I tried with the case wasn't with vselect32).

I did setOperationAction(ISD::BUILD_VECTOR, MVT::v16i1, Expand) on a whim and it seems to generate correct code, even if the constant pool entry has type i32 and alignment 16, which is more than a bit iffy.

Okay, I didn't think to try expand in this case, but it often does something useful. I guess on the alignment it is looking at is from the vector type?

It sounds like you're making good progress!

from nyuzitoolchain.

jbush001 commented on August 20, 2024

One thought I just had (but haven't tested) is that there could be a custom SDNode that takes an i32 as an input and returns an v16i1 as a result. Then there could then be an empty rule to eliminate it during instruction selection:

def i32_to_mask : SDNode<"NyuziISD::INT_TO_MASK", SDTypeProfile<1, 1,
[SDTCisVT<0, v16i1>, SDTCisVT<1, i32>]>>;

def : Pat<(v16i1 (i32_to_mask (i32 GPR32:$src))), (v16i1 GPR32:$src)>;

That could replace all the custom inserter gunk and be much simpler.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Nope :)

Assertion failed: ((NodeToMatch->getValueType(i) == Res.getValueType() || NodeToMatch->getValueType(i) == MVT::iPTR || Res.getValueType() == MVT::iPTR || NodeToMatch->getValueType(i).getSizeInBits() == Res.getValueSizeInBits()) && "invalid replacement"), function SelectCodeCommon, file /Users/jeffbush/src/NyuziToolchain/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, line 3629.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I did setOperationAction(ISD::BUILD_VECTOR, MVT::v16i1, Expand) on a whim and it seems to generate correct code

Nevermind, the DAGs looked good but the actual code is dead wrong. The constant pool entry for a v16i1 is 16 .bytes each storing a bit. What's worse, the load is a load in a scalar register (i.e., loading only the first four entries).

I guess I'll have to make do with a custom lowering. The constant pool entry could be 16 bit though, as v16i1 is a legal type and can be stored and loaded normally (with store_16 and load_u16, though I see the diff you uploaded doesn't include patterns for that).

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

can't put an i16 in the constant pool because it's not legal
can't put a v16i1 in the constant pool because it'll be treated like <16 x i8>
putting an i32 in the constant pool generates deceptively plausible code but it's fraught with endianness issues (32 bit value, but 16 bit load).

The last one might be fixable (it might even work out fine on its own, I'm not sure) but it's super icky. I wonder if constant pool's bad behavior with i1 vectors could be fixed without breaking other targets?

from nyuzitoolchain.

jbush001 commented on August 20, 2024

can't put an i16 in the constant pool because it's not legal

I considered making mask registers be sub registers of the scalar registers (this feature is used to support packing AH and AL registers into AX on x86), but that seems pretty ugly and hacky and would make the assembly look weird

putting an i32 in the constant pool generates deceptively plausible code but it's fraught with endianness issues (32 bit value, but 16 bit load).

It could do a 32-bit load and use the same trick that INSERT_VECTOR_ELT uses to convert the computed mask in into a v16i1 (in my example, the VSELECT_32 custom SDNode).

I wonder if constant pool's bad behavior with i1 vectors could be fixed without breaking other targets?

I was wondering that too. I'm guessing if it broke anything it would be the AVX stuff in the X86 backend.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I considered making mask registers be sub registers of the scalar registers (this feature is used to support packing AH and AL registers into AX on x86), but that seems pretty ugly and hacky and would make the assembly look weird

I considered this as well, but as the instruction set doesn't really have separate registers for them, so this would be yet another "lie" which is bound to complicate everything else.

I was wondering that too. I'm guessing if it broke anything it would be the AVX stuff in the X86 backend.

In my (admittedly limited) tests with AVX-512, I couldn't get to make it generate a constant pool entry for any masks. This makes sense as it has moves with 16 bit immediates, so it can generate sequences like this:

        mov     eax, -2 # 0xFFFE
        kmovw   k1, eax

I do have good news though! As a nop SDNode for v16i1<->i32 casts did not work, I added pseudo instructions for those operations as well and replaced them with moves after instruction selection. (Perhaps the pseudos aren't even needed and the SDNodes can just be matched to moves during isel.) This seems to work so far, and would limit the excessive number of special SDNodes for various operations (just lower them all normally, using these two nodes for converting between v16i1 and i32 where needed). It generates superfluous move instructions, but this is relatively small overhead, and can hopefully be optimized by MIR-level peephole optimizations.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Replacing the moves seems like a valid approach. I was also thinking this would probably be easier if I changed the ISA to have vectors with 32 lanes. :)

I've been trying to put my finger on what fundamental assumption of LLVM this architecture is violating that is causing so many problems.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

Quick update: I've now applied this approach to the lowering of BUILD_VECTOR, INSERT_VECTOR_ELT (including support for v16i1), and VECTOR_SHUFFLE (replacing the vselecti32 kludge and consequently the SYNTH_MOVE_* pseudos). It works wonderfully so far. I even convinced LLVM to optimize out some moves by emitting generic COPY MachineInsts instead of target-specific move opcodes (edit: I also eliminated the pseudo instructions, it's now just two mask_{to,from}_int -> COPY isel patterns). For example, a function that returns a constant i1 vector can now simply be move s0, <const>; ret (if the constant fits in an immediate).

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

Oh, by the way, this test confuses me. It expects a 7 as the mask, but shouldn't it be either 2^7 or 2^(15-7)? (The lowering generates the latter, I'll assume that's correct.) Indeed if I add the missing : to the CHECK (it's ignored as is) the test fails. Is this just a brainfart or am I missing something?

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Oh, by the way, this test confuses me.

Yes, you're right, good catch.

Nice work! that sounds like a pretty straightforward solution.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I've started to work through failing test cases and I was reminded of this part of this code:

// Native vector compare instructions return a bitmask. This function
// returns v16i32 from a comparison by doing a predicated transfer.
// clang seems to assume a vector lane should have 0xffffffff when the
// result is true when folding constants, so we use that value here to be
// consistent, even though that is not what a scalar compare would do.
SDValue expandVectorComparison(SDValue Op, SelectionDAG &DAG) {

Of course, the whole issue stops making sense when comparison results are v16i1 instead of v16i32. Do you know what breaks if that assumption is violated? Or alternatively, in what ways does clang assume this? Ideally this could be removed, but I don't want to inadvertently break even more things.

Edit: Is this really clang or is it LLVM passes? In the latter case, maybe TargetLowering::getBooleanContents is related?

from nyuzitoolchain.

jbush001 commented on August 20, 2024

I don't think this behavior is affected by the change.

Clang seems to follow the behavior of GCC vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), which states:

Vector comparison is supported with standard comparison operators: ==, !=, <, <=, >, >=...The result of the comparison is a vector of the same width and number of elements as the comparison operands with a signed integral element type.

So, in the following code...

typedef int veci16_t __attribute__((ext_vector_type(16)));

veci16_t is_greater(veci16_t a, veci16_t b)
{
    return a > b;
}

The type of the expression "a > b" is veci16_t (v16i32). It emits LLVM IR:

define <16 x i32> @is_greater(<16 x i32> %a, <16 x i32> %b) local_unnamed_addr #0 {
entry:
  %cmp = icmp sgt <16 x i32> %a, %b
  %sext = sext <16 x i1> %cmp to <16 x i32>
  ret <16 x i32> %sext
}

At some point this gets turned into a SETCC node with the type v16i32. This is the case I'm trying to handle.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I see. I tried a few examples and I couldn't manage to produce a setcc node with type v16i32, though, I always got a normal (v16i1) setcc with a sign_extend. This is exactly what I would expect given the IR generated by clang. You don't happen to have an example at hand that produces it?

I'm asking both to understand, and because I'm awfully tempted to move the code from expandVectorComparison into LowerSIGN_EXTEND and remove the special case from LowerSETCC.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Yes, I think you're right that it would make sense to move that from LowerSETCC to LowerSIGN_EXTEND.

Now that v16i1 is a legal type, LowerSETCC should no longer be called with v16i32, which you observed.
The way I had implemented LowerSETCC ended up generating inefficient code anyway

If I compile the following with the old compiler:

  %cmp = icmp sgt <16 x i32> %a, %b
  %sext = sext <16 x i1> %cmp to <16 x i32>

...I see the following transformation:

      t6: v16i1 = setcc t2, t4, setgt:ch
    t7: v16i32 = sign_extend t6

Promote integer result: t6: v16i1 = setcc t2, t4, setgt:ch

      t15: v16i32 = setcc t2, t4, setgt:ch
    t17: v16i32 = sign_extend_inreg t15, ValueType:ch:v16i1

The promotion of t6 occurs during type legalization. This no longer happens as mentioned above.

When the lowering for SETCC occurs, I generated a predicated move of -1 or 0 (the vector comparison is supposed to put 0xffffffff into each lane if the comparison is true):

          t23: i32 = llvm.nyuzi.__builtin_nyuzi_mask_cmpi_sgt Constant:i32<3779>, t2, t4
          t21: v16i32 = NyuziISD::SPLAT Constant:i32<-1>
          t19: v16i32 = NyuziISD::SPLAT Constant:i32<0>
        t25: v16i32 = llvm.nyuzi.__builtin_nyuzi_vector_mixi Constant:i32<3794>, t23, t21, t19

Then it also lowers the sign extension:

      t28: v16i32 = shl t25, t30
    t29: v16i32 = sra t28, t30

The sign extension is unnecessary, since I've already put the extended value into the registers in expandVectorComparison.

The final assembly is:

	cmpgt_i s0, v0, v1
	move v0, 0
	move_mask v0, s0, -1
	shl v0, v0, 31
	ashr v0, v0, 31

The last two instructions are the sign extension. If you did the check in LowerSIGN_EXTEND, it would avoid that.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I came across an unrelated (edit: well, kinda) issue in setcc lowering, perhaps you have an idea how to do it better:

Unordered float comparisons are lowered to the complementary comparison, for example SETUGT -> NOT SETOLE. The code previously did negation by xor'ing with an i32 0xFFFF constant. Strangely, despite looking like an all-ones mask, this only happened for scalar comparisons (maybe precisely because of the issue I've encountered below?). Even more strangely, vector comparisons were mapped straight to the intrinsics without accounting for unordered comparisons (e.g., vector fcmp ugt -> vector cmpf_gt instruction).
Anyway, this 0xFFFF would then be further legalized to a constant pool load. This stopped being an option when v16i1 and vector comparisons became legal as the code path was no longer only for scalars. I kept the CC shuffling as-is, and only distinguished between scalars and vectors for the negation. I chose an xor with i32 1 for the scalar case, because why not — it's simpler and saves a constant pool load (also, at the time I believed the old code used 0xFFFF because of vector comparisons). This lead to test cases like this one to go into an infinite legalization loop:

define i32 @cmpfgtu(float %a, float %b) { ; CHECK-LABEL: cmpfgtu:
  %cmp = fcmp ugt float %a, %b
  %ret = zext i1 %cmp to i32

  ; CHECK: cmple_f [[CMPRES:s[0-9]+]], s0, s1
  ; CHECK: xor s{{[0-9]+}}, [[CMPRES]]

  ret i32 %ret
}

Turns out that the DAG combiner can't see through constant pool loads, but will happily "simplify" xor (setcc $a, $b, setole), 1 into setcc $a, $b, setugt, which then gets legalized back into xor (setcc $a, $b, setole), 1 ad infinitum. Using the 0xFFFF constant for the scalar case (i.e., provoking a constant pool load) does work just as it did before, but this is not only quite inefficient, it's also very fragile — once the DAG combiner gets smarter about constant pools, everything will break. I also assume the same issue applies to vector comparisons (i.e., they too only work because the DAG combiner can't see through the constant pool load for the 0xFFFF mask).

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Strangely, despite looking like an all-ones mask, this only happened for scalar comparisons (maybe precisely because of the issue I've encountered below?)

The scalar comparison instruction returns 0xffff when the value is true. This is an artifact of the hardware implementation (scalar values are duplicated to all lanes). At some point, the backend ands this with 1 so it returns what the compiler expects, but I don't remember off the top of my head where or how that happens:

    10a8:	01 00 c0 c2 	cmpgt_f s0, s1, s0
    10ac:	00 04 80 00 	and s0, s0, 1

vector comparisons were mapped straight to the intrinsics without accounting for unordered comparisons

Yes. I didn't bother to implement the unordered comparison builtins/intrinsics because they didn't seem that useful. I had only implemented the scalar versions for completeness (the compiler always seems to emit ordered versions).

not only quite inefficient, it's also very fragile — once the DAG combiner gets smarter about constant pools, everything will break.

Yeah, definitely.

Is that transform happening here?

NyuziToolchain/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

Line 4671 in bad1750

// fold !(x cc y) -> (x !cc y)

Even if not, I think I have an idea of why this may be happening.

Generally, as I understand, the DAG combining code should call into TargetLowering to determine if an operation is legal and skip the transform if it is not. The backend sets unordered operations as Custom:

NyuziToolchain/lib/Target/Nyuzi/NyuziISelLowering.cpp

Line 264 in bad1750

setCondCodeAction(ISD::SETUGT, MVT::f32, Custom);

I would have assumed that a custom lowering would not be legal, but this is not the case:

  /// Return true if the specified condition code is legal on this target.
  bool isCondCodeLegal(ISD::CondCode CC, MVT VT) const {
    return
      getCondCodeAction(CC, VT) == Legal ||
      getCondCodeAction(CC, VT) == Custom;
  }

So I may be violating an assumption/unwritten rule in the LLVM backend that one shouldn't use a custom lowering to convert to other legal node types. If that is the case, I see two possible approaches.

The first is to use the Expand action for these types. It appears the default action may do the right thing:

NyuziToolchain/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Line 3640 in bad1750

if (NeedInvert)

If the type were 'Expand', it would not be treated as legal, so it wouldn't be converted back to an unordered operation.

The other approach--and I think I've seen other backends do this--is to create a custom target-specific SETCC SDNode type, which the DAG combiner won't recognize. This precludes further optimizations, but that's basically the point (since the custom action is deliberately creating a less optimal form, and the compiler is reasonably trying to fix that).

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I've tried expanding now. It mostly works (seto, setuo, and the don't-care CCs still need custom lowering, but those aren't as problematic) but the legalization creates two comparisons: an ordered one of the same type, and a NaN check. This would be a pretty serious regression, so I'm leaning towards the custom SDNodes (I've checked ARM and it does basically that, though I also noticed it is far more aggressive with introducing target-specific SDNodes in general).

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I do wonder why it's expanded like that, though. Is is just to be conservative, or is there some edge case where the lowering you implemented does not work?

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Good question. Floating point is subtle, and I'm certainly not an expert on it. It seems like the explicit NaN check should be unnecessary in this case. The only thing I can think of is that it could make a difference on architectures that support exceptions...mumble...

It looks like the code for that expansion is here (EDIT: tested and verified this code is called when expanding an unordered comparison):

https://github.com/llvm-mirror/llvm/blob/cc9614d291972c3dea88815ad4afa5d8b14bad0c/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp#L1579

I don't see any interesting explanatory comments. Here's the check-in that created that code
EDIT: I had the wrong check-in before, that one just modified the original. The code seems to have been created here:

llvm-mirror/llvm@7f04268

This might be a good question for the LLVM mailing list.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

Regardless, I've now implemented the custom SDNode approach. It passes all tests (after adjusting make_tests.py because unordered vector comparisons are now lowered correctly as a happy side effect) and the code is rather clean.

After that, the only remaining failure in the test suite was setcc_shuffle.ll. As v16i1 is not legal, it also needs shuffle support — I implemented this by extending the elements to i32, shuffling an i32 vector, and truncating the result back to a mask. This forced me to implement v16i1<->v16i32 zext and trunc, which are probably sensible to have anyway.

There are still some things to be done, e.g., more general support for v16i1 constants, support for v16i1 immediates on arithmetic instructions, and many more tests. But I think I can see the light at the end of the tunnel 😄

from nyuzitoolchain.

jbush001 commented on August 20, 2024

That's great! Have you run any of the tests in the NyuziProcessor tree yet?

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I haven't had the chance to set that up yet. Besides, I'd like to ensure the basic parts of code generation work before throwing real software at it.

Speaking of which, while writing codegen tests I discovered another wrinkle. Or re-discovered, I think I encountered this while dealing with the v16i1 constant pool madness and just didn't understand it. When a vector is stored in memory, of course LLVM assumes that each vector element is separately addressable. For example, when inserting a few elements into a v16i1, the vector is stored on the stack and then the elements to be inserted are stored there (of course, the offsets are all wrong) before reading back the whole vector.

So while masks can be stored compactly in scalar registers, it doesn't seem possible to avoid a 16 x i8 representation in memory. I doubt there's a nice and efficient way to implement such loads and stores, but hopefully this should be rare anyway. I'm more troubled by the prospect of having to emulate those loads and stores at all. I took a precursory stab at both (1) doing it during legalization, and (2) selecting a pseudo-instruction and emitting custom code for it. Both are very ugly — (1) is rather complicated because loads and stores are rather complicated in the SelectionDAG, and (2) can't piggy back on the legalization code that lowers vector insertions or constants larger than 13 bits.

Unless you have a sudden insight that renders this moot, I might go with (2) and generate code roughly like this (for the case of a load, and surely containing some mistakes):

; s0 = address of <16 x i8>
  move s1, 0 ; result acc
  move s2, 1 ; mask for inserting into result
  shl s2, s2, 15 ; avoid constant pool for 0x8000
loop:
  load_u8 s3, (s0) ; get ext. i1
  and s3, s3, 1 ; upper bits are undef
  xor s4, s2, -1
  and s1, s1, s4 ; zero out current element of mask
  bnz s3, elem_zero ; if inserting false, leave at zero
  or s1, s1, s2 ; otherwise, insert a one
elem_zero:
  add s0, s0, 1 ; advance element ptr
  shr s2, s2, 1 ; adjust mask
  bnz s2, loop ; 16 iterations

(Edit: Typos in the last three instructions.)

from nyuzitoolchain.

jbush001 commented on August 20, 2024

When a vector is stored in memory, of course LLVM assumes that each vector element is separately addressable.... it doesn't seem possible to avoid a 16 x i8 representation in memory.

It seems like X86 AVX code would run into a similar problem. They do seem to have a pattern to store a packed mask register directly to memory:

  def : Pat<(store (i16 (bitconvert (v16i1 VK16:$src))), addr:$dst),
            (KMOVWmk addr:$dst, VK16:$src)>;

For example, when inserting a few elements into a v16i1, the vector is stored on the stack and then the elements to be inserted are stored there (of course, the offsets are all wrong) before reading back the whole vector.

Can you elaborate on which instruction expands to this? I assume this is an INSERT_VECTOR_ELT node. If so, would it be possible to lower that directly to set of bit manipulations?

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

It seems like X86 AVX code would run into a similar problem. They do seem to have a pattern to store a packed mask register directly to memory:

This pattern matches an i16 store (where the i16 is bitcast'd from a v16i1), not an i1 vector store. So it's just an optimization that avoids a K register -> GPR move on certain kinds of integer stores. This somewhat matches my observation that the __mmask16 type seems to be represented as i16 (see godbolt, select clang 4.0 and add -emit-llvm to the flags to get IR). Perhaps AVX-512 completely punts on loading and storing v16i1? I guess that would be good news, because it would mean that's possible and reasonable, but I'm not quite sure how they achieve this...

Can you elaborate on which instruction expands to this? I assume this is an INSERT_VECTOR_ELT node. If so, would it be possible to lower that directly to set of bit manipulations?

No, BUILD_VECTOR (edit: my testcast was inserting into a zeroinitializer so the DAG combiner turned it all into a BUILD_VECTOR). I think it would be possible (reasonably easy, even, if code quality is not that important) to lower this better. But this was just one example. LLVM's desire to have addressable vector elements also expresses itself in the constant pool problems we had, in the fact that getelementpointer can give you a i1 * from a <16 x i1> *, and probably other examples.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

How do you get getelementptr to index into vectors? llc freaks out when I try it (with the x86 backend):

define i32 @compare(<16 x i1>* %a) {
entry:
    %b = getelementptr i1, <16 x i1>* %a, i64 0

error: explicit pointee type doesn't match operand's pointee type
    %b = getelementptr i1, <16 x i1>* %a, i64 0

I found this bit:

http://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements

If we can remove getelementptr from the mix, then it seems like this gets easier. It looks like the default lowerings for the vector building instructions assume memory addressable elements. Since hardware doesn't support those, the backend should turn them into operations that are supported (bit manipulations in register). Maybe I'm oversimplifying this or missing something.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

This GEP works the same with and without AVX512:

define i1* @getptr(<16 x i1>*) #0 {
entry:
  %b = getelementptr <16 x i1>, <16 x i1>* %0, i32 0, i32 5
  ret i1* %b
}

It emits leaq 5(%rdi), %rax, i.e., it treats the i1 elements as byte addressable (as they almost certainly are without AVX-512). In contrast, this function:

define i1 @extractval(<16 x i1>*) #0 {
  %mask = load <16 x i1>, <16 x i1>* %0
  %b = extractelement <16 x i1> %mask, i32 5
  ret i1 %b
}

compiles to this:

	kmovw	(%rdi), %k0
	kshiftlw	$10, %k0, %k0
	kshiftrw	$15, %k0, %k0

If I understood the semantics of kmovw correctly, it will read/write at most 64 bit (or, more likely, 16 bit). This is inconsistent with the first function above, i.e., one of these functions seems to be miscompiled. So I'm guessing the X86 backend has encountered the same problem and decided to ignore it in the case of getelementptr. If they do it, I guess Nyuzi can afford to do that too I guess?

(It's probably a complication for what I am working towards — vectorizing OpenCL style SPMD programs — but I'll burn that bridge when I come to it.)

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

So between the AVX-512 backend punting on it, the warning from the docs you quoted, and the fact that all other operations that were or are problematic are easy to lower, I am more than happy to ignore v16i1 GEP and finish up the current approach :)

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Okay, that's simpler. :) The other thing is that I wouldn't expect any reasonable front-end to emit a GEP on a v16i1.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I have one more question about the diff you posted earlier: What was the intent behind these patterns?

def : Pat<(add v16i1:$src1, v16i1:$src2), (ADDISSS v16i1:$src1, v16i1:$src2)>;
def : Pat<(sra v16i1:$src1, v16i1:$src2), (SRASSS v16i1:$src1, v16i1:$src2)>;
def : Pat<(srl v16i1:$src1, v16i1:$src2), (SRLSSS v16i1:$src1, v16i1:$src2)>;
def : Pat<(shl v16i1:$src1, v16i1:$src2), (SLLSSS v16i1:$src1, v16i1:$src2)>;

// Mask comparisons
def : Pat<(i32 (setcc v16i1:$src1, v16i1:$src2, SETGT)),
  (SGTSISS v16i1:$src1, v16i1:$src2)>;
/* ... */
def : Pat<(i32 (setcc v16i1:$src1, v16i1:$src2, SETULE)),
  (SLEUISS v16i1:$src1, v16i1:$src2)>;

Bitwise operations, sure, but neither comparisons nor arithmetic seem to make sense to me (and neither do bit shifts with those operand types).

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Those don't make sense to me either :)

I'll explain the thought process that I believe ended up with me producing that code. Given the following C code:

    int mask = __builtin_nyuzi_mask_cmpi_sle(b, a)
            & __builtin_nyuzi_mask_cmpi_sle(c, a);

The compiler should generate IR something like this with the new implementation (the builtin functions create bitcast/zext combinations to convert the results into ints, as discussed earlier)

    %0 = icmp sle <16 x i32> %b, %a
    %1 = bitcast v16i1 %0 to i16
    %2 = zext i16 %1 to i32

    %3 = icmp sle <16 x i32> %c, %a
    %4 = bitcast v16i1 %3 to i16
    %5 = zext i16 %4 to i32

    %6 = and i32 %2, %5

During some optimization stage, it figures out that it can optimize away the redundant bitcast/zext combos:

    %0 = icmp sle <16 x i32> %b, %a
    %1 = icmp sle <16 x i32> %c, %a
    %2 = and v16i1 %0, %1
    %4 = bitcast v16i1 %2 to i16
    %5 = zext i16 %4 to i32

When I was originally implementing the backend with v16i1 masks, I first assumed I only needed to implement arithmetic on i32 types, since the front end always converted them. However, it asserted when a transform like the one above occurred because I hadn't implemented AND on v16i1.

"But," I reasoned, "I can do arbitrary arithmetic on masks." For example, during rasterization, I have something like the following:

   int trivialAcceptMask = __builtin_nyuzi_mask_cmpi_sle(acceptEdgeValue1, veci16_t(0))
            & __builtin_nyuzi_mask_cmpi_sle(acceptEdgeValue2, veci16_t(0))
            & __builtin_nyuzi_mask_cmpi_sle(acceptEdgeValue3, veci16_t(0));
    while (trivialAcceptMask)
    {
        const int index = __builtin_clz(trivialAcceptMask) - 16;
        trivialAcceptMask &= ~(0x8000 >> index);
        ...

I didn't think very hard about this, I just assumed needed to support all arithmetic operations on v16i1. As I look at it now, that doesn't make much sense. It seems unlikely the compiler would shuffle the bitcast/zext on operations that are not bitwise. At least I hope not. That would break a lot of stuff. :)

So it's probably sufficient to just implement the bitwise logical operations on v16i1: and, or, not.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

Good, so I removed those patterns.

Status update: I've added all the llvm-lit tests I wanted, and they all pass. I've moved on to running the tests in the NyuziProcessor repo, and I'm encountering failures on various tests in tests/compiler that print floats. I've minimized it down to this IR:

define i1 @bar(float %x) {
entry:
  %x.is.pos = fcmp ogt float %x, 0.000000e+00
  br i1 %x.is.pos, label %return-true, label %return-false

return-true:
  ret i1 true

return-false:
  ret i1 false
}

This generates:

# BB#0:                                 # %entry
	load_32 s1, .LCPI0_0
	cmpgt_f s0, s0, s1
	cmpne_i s0, s0, -1
	bnz s0, .LBB0_2
# BB#1:                                 # %return-true
	move s0, 1
	ret
.LBB0_2:                                # %return-false
	move s0, 0
	ret

The cmpne_i is wrong. Looks like an attempt to invert the comparison result (integer comparison works and branches on cmplt_i s0, 1, i.e., seems to negate correctly). Simplifying the IR to a select on the comparison result generates correct code, so I assume it's an interaction of float comparisons and (negated) branching. Does this ring any bells? (I'll keep digging myself, but won't have much time in the next few days.)

from nyuzitoolchain.

jbush001 commented on August 20, 2024

I can't think of an obvious reason why that would happen off the top of my head, unfortunately. I don't think I ever explicitly compared to 0xffffffff in the lowering code.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

Okay, it's my own damn fault. The comparison with -1 is what the DAG combiner "optimizes" the negation of the float comparison into (a previous DAG combiner pass turns the ogt into an ult). It compares with -1 instead of 0xffff because I used DAG.getNOT for the negation, which is incorrect (and quite obviously so, in hindsight).

Why there even is a negated comparison when ogt is natively supported is a question for another day. Meanwhile, I'm down to 4/45 failures in test/compiler (and all other test suites passing AFAICT). The reproduction from above now generates this horrible code by the way:

	load_32 s1, .LCPI0_0 # float 0
	cmpgt_f s0, s0, s1
	load_32 s1, .LCPI0_1 # 0xffff
	cmpne_i s0, s0, s1
	bnz s0, .LBB0_2

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

At least once of the remaining test failures is related to how masks are built. I can't pin point any more why exactly I thought that (I believe I saw something to that effect in the existing lowering code), but I was working under the assumption that lane index 0 is the most significant bit and lane index 15 the least significant bit. But in the face of test/compiler/block_store.c failing, I am beginning to suspect it's the other way around. Can you please clarify?

Edit: I think I found the place that lead me to this belief:

NyuziToolchain/lib/Target/Nyuzi/NyuziISelLowering.cpp

Lines 971 to 988 in bad1750

    
           // (VECTOR, VAL, IDX) 
        
           // Convert to a move with a mask (0x8000 >> IDX) and a splatted scalar operand. 
        
           SDValue NyuziTargetLowering::LowerINSERT_VECTOR_ELT(SDValue Op, 
        
                                                               SelectionDAG &DAG) const { 
        
             MVT VT = Op.getValueType().getSimpleVT(); 
        
             SDLoc DL(Op); 
        
             // This could also be (1 << (15 - index)), which avoids the load of 0x8000 
        
             // but requires more operations. 
        
             SDValue Mask = 
        
                 DAG.getNode(ISD::SRL, DL, MVT::i32, DAG.getConstant(0x8000, DL, MVT::i32), 
        
                             Op.getOperand(2)); 
        
             SDValue Splat = DAG.getNode(NyuziISD::SPLAT, DL, VT, Op.getOperand(1)); 
        
             return DAG.getNode( 
        
                 ISD::INTRINSIC_WO_CHAIN, DL, VT, 
        
                 DAG.getConstant(Intrinsic::nyuzi_vector_mixi, DL, MVT::i32), Mask, Splat, 
        
                 Op.getOperand(0)); 
        
           }

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Yes, this is confusing and I usually end up reversing it a few times every time I work on this. I'll try to hopefully not muddle things further. :)

EDIT: Ugh... I totally muddled things, should be correct now.

LLVM uses ~~the opposite convention of~~the same vector lane numbering convention as Nyuzi's instruction set (I originally numbered it backwards, but changed the instruction set later to avoid having to convert back and forth all over the place in the backend).

For a block vector store, lane 0 is the lowest address stored and lane 15 is the highest address stored. This corresponds to the lane numbering convention used by INSERT_VECTOR_ELT/EXTRACT_VECTOR_ELT.

However, Nyuzi maps its lane 15 to the least significant bit in the mask and lane 0 to the most significant.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

For example, assume the following code:

typedef int veci16_t __attribute__((ext_vector_type(16)));

veci16_t value;

int main(void)
{
    value[0] = 99;
    printf("%d\n", ((volatile int*) &value)[0]);
}

This generates the following LLVM IR (using the stock x86 clang compiler on MacOS):

@value = common global <16 x i32> zeroinitializer, align 16
@.str = private unnamed_addr constant [7 x i8] c"%d\0A\00", align 1

define i32 @main() #0 {
  %1 = load <16 x i32>, <16 x i32>* @value, align 16
  %2 = insertelement <16 x i32> %1, i32 99, i32 0
  store <16 x i32> %2, <16 x i32>* @value, align 16
  %3 = load volatile i32, i32* getelementptr inbounds (<16 x i32>, <16 x i32>* @value, i64 0, i64 0), align 16, !tbaa !2
  %4 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str, i64 0, i64 0), i32 %3)
  ret i32 0
}

Note that insertelement uses lane 0. It prints the following output when I run on my x86 desktop:

So lane 0 corresponds to the lowest address.

If I compile using the Nyuzi compiler, the write to the vector lane generates:

.LCPI0_0:
	.long	value
.LCPI0_1:
	.long	32768                   # 0x8000
main:
	load_32 s0, .LCPI0_0    # Load pointer to vector value
	load_v v0, (s0)
	load_32 s1, .LCPI0_1      # Load mask 
	move_mask v0, s1, 99   # Update lane 15
	store_v v0, (s0)

In this case, the 15th bit (MSB) corresponds to the lowest address.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Likewise, the getlane instruction uses the same lane index as LLVM/C:

int foo(veci16_t value) { return value[0]; }

Compiles to:

foo:
	getlane s0, v0, 0
	ret

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I've been tearing my hair out tracking down the last test failure (sorted-insert.cpp) and finally reduced it to this:

#include <stdint.h>
#include <stdio.h>

int main()
{
	vecf16_t items = 1000000000.0;
	float value = 42.0;
	items = __builtin_nyuzi_vector_mixf(0x8000, vecf16_t(value), items);
	printf("%g\n", items[0]); // CHECK: 42.0
	return 0;
}

This passes with the old toolchain, and also in the new toolchain with -O0, but with -O1 and higher it gets optimized to printf("%g\n", 1000000000.0). I suspect this is because masks are "reversed" from LLVM's POV. The unoptimized IR has this as the vector select argument: bitcast (<1 x i16> <i16 -32768> to <16 x i1>). This corresponds to <i1 false, ..., i1 false, i1 true>, i.e., it's exactly the opposite of what the code is intended to do and what it does do if 0x8000 gets passed to the hardware without optimizations.

I really don't see how to work around this. Perhaps doing a bit-reverse when converting between mask-as-int and mask-as-i1-vector? That would be a serious performance regression, though — not to mention that it would break C code that deals with masks numerically, including the very test case from which the above is derived.

from nyuzitoolchain.

jbush001 commented on August 20, 2024

Okay, after my last comment it crossed my mind briefly that the order of mask bits might be reversed, but I convinced myself (incorrectly, apparently), that it was the same.

So the right fix is probably to change the instruction set to reverse the order of the mask bits (that's the beauty of a custom instruction set. :) This would basically be a follow-on to the change I made previously reversing the lane numbering to match LLVM's convention. Does that sound right?

I can take a whack at implementing that on the hardware/emulator side.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

That does feel like the Right Way.

I can take a whack at implementing that on the hardware/emulator side.

Please do. I'll tackle the aforementioned crappy code generation for branch-on-ogt and other clean up in preparation of rebasing on top of that.

from nyuzitoolchain.

hanna-kruppe commented on August 20, 2024

I still have no idea why branching on an fcmp generates such terrible code, but I did notice that it was basically the same before my changes. The only difference is that it used to xor the comparison result with 0xffff instead of using cmpne_i. I'll just leave it be, but I added an XFAIL test for it.

from nyuzitoolchain.

Vector cmp/mix intrinsics inhibit optimizations about nyuzitoolchain HOT 50 CLOSED

Comments (50)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	// (VECTOR, VAL, IDX)
	// Convert to a move with a mask (0x8000 >> IDX) and a splatted scalar operand.
	SDValue NyuziTargetLowering::LowerINSERT_VECTOR_ELT(SDValue Op,
	SelectionDAG &DAG) const {
	MVT VT = Op.getValueType().getSimpleVT();
	SDLoc DL(Op);

	// This could also be (1 << (15 - index)), which avoids the load of 0x8000
	// but requires more operations.
	SDValue Mask =
	DAG.getNode(ISD::SRL, DL, MVT::i32, DAG.getConstant(0x8000, DL, MVT::i32),
	Op.getOperand(2));
	SDValue Splat = DAG.getNode(NyuziISD::SPLAT, DL, VT, Op.getOperand(1));
	return DAG.getNode(
	ISD::INTRINSIC_WO_CHAIN, DL, VT,
	DAG.getConstant(Intrinsic::nyuzi_vector_mixi, DL, MVT::i32), Mask, Splat,
	Op.getOperand(0));
	}