andreas-abel / uica Goto Github PK

View Code? Open in Web Editor NEW

229.0 229.0 17.0 342 KB

uops.info Code Analyzer

License: GNU Affero General Public License v3.0

Python 95.74% HTML 3.85% Shell 0.19% Batchfile 0.22%

uica's People

Contributors

Stargazers

Watchers

Forkers

adsharma rygorous gengrill dtharun mrcodechef chriselrod vikmik davidywu9 aerosayan pure-water hardsecurity zytsnlz clayne boomanaiden154 jamestiotio qaco cjdresel

uica's Issues

Strange WAW/WAR-dependence edge in dependency graph

For the following testcase (short link: https://bit.ly/3v1QfWV )

loop:
add rax, [rsi]
adc rax, [rsi]
dec rcx
jnz loop

There's the expected critical path over RAX updates, plus unexpected two-cycle critical path that goes over the carry flag:

I'm surprised to see the backward edge from CF to the first instruction there.

Simulation inaccuracy for 256-bit loads/stores on SNB

On Sandy Bridge (and Ivy Bridge), 256-bit AVX loads and stores had half throughput compared to their 128-bit SSE counterparts, so ideally uiCA should show that the following loop runs at 2 cycles per iteration:

loop:
vmovaps ymm0, [rsi]
vmovaps ymm0, [rsi]
dec ecx
jnz loop

(I guess this might be not straightforward to model, since it's not the same as if load uop occupied port2/3 for two cycles, because on the second cycle the port still can perform store-address part of another store uop?)

Support uiCA in godbolt

It would be great if uiCA were supported as a tool in godbolt similar to llvm-mca. In compiler-explorer/compiler-explorer#2843, RubenRBS indicated that they would be willing to review a PR adding that feature, and indicated similar PRs can serve as a model.

I may be able to get around to doing it at some point, but thought I'd raise this as an issue here in case uiCA authors are interested in pursuing it.

Simulation error for Skylake (python exception)

I was exploring variants of a loop and found one where uiCA.py throws an exception even though nothing out of the ordinary seems to happen in the loop.

Short link to uica.uops.info: https://bit.ly/3Pga31D

For reference, the loop on the above link that shows the issue on Skylake..Cascade Lake:

loop:
vmovaps ymm0, [rsi]
vmovaps ymm1, [rsi+32]
vorps ymm0, ymm0, [rsi+64]
vorps ymm1, ymm1, [rsi+96]
vorps ymm1, ymm1, [rsi+128]
vorps ymm1, ymm1, [rsi+160]
vorps ymm0, ymm0, [rsi+192]
vmovaps xmm2, [rsi+224]
vpor xmm2, xmm2, [rsi+240]
add rsi, 256

vorps ymm0, ymm0, ymm1
vextractf128 xmm1, ymm0, 1
vpor xmm0, xmm0, xmm1
vpor xmm0, xmm0, xmm2
vpcmpeqb xmm0, xmm0, xmm7
vpmovmskb eax, xmm0
test eax, eax
jz out
dec ecx
jnz loop
out:

Exception trace:

Traceback (most recent call last):
  File "/uiCA/uiCA.py", line 2448, in <module>
    main()
  File "/uiCA/uiCA.py", line 2441, in main
    TP = runSimulation(disas, uArchConfig, int(args.alignmentOffset), args.initPolicy, args.noMicroFusion, args.noMacroFusion, args.simpleFrontEnd,
  File "/uiCA/uiCA.py", line 2294, in runSimulation
    frontEnd.cycle(clock)
  File "/uiCA/uiCA.py", line 572, in cycle
    newInstrIUops = self.DSB.cycle()
  File "/uiCA/uiCA.py", line 743, in cycle
    DSBBlock = self.DSBBlockQueue[0]
IndexError: deque index out of range

Thank you so much for uiCA!

Operand size mismatch with `vgatherdps` using Intel syntax

.LBB0_5:                                # %L50
                                        # =>This Inner Loop Header: Depth=1
        vmovups zmm1, zmmword ptr [r11 + 4*rdi]
        vmovups zmm2, zmmword ptr [r11 + 4*rdi + 64]
        vmovups zmm3, zmmword ptr [r11 + 4*rdi + 128]
        vmovups zmm4, zmmword ptr [r11 + 4*rdi + 192]
        kxnorw  k1, k0, k0
        vxorps  xmm5, xmm5, xmm5
        vgatherdps      zmm5 {k1}, zmmword ptr [rax + 4*zmm1]
        kxnorw  k1, k0, k0
        vxorps  xmm1, xmm1, xmm1
        vgatherdps      zmm1 {k1}, zmmword ptr [rax + 4*zmm2]
        vfmsub132ps     zmm5, zmm0, zmmword ptr [rdx + 4*rdi] # zmm5 = (zmm5 * mem) - zmm0
        vfmadd132ps     zmm1, zmm5, zmmword ptr [rdx + 4*rdi + 64] # zmm1 = (zmm1 * mem) + zmm5
        kxnorw  k1, k0, k0
        vxorps  xmm2, xmm2, xmm2
        vgatherdps      zmm2 {k1}, zmmword ptr [rax + 4*zmm3]
        kxnorw  k1, k0, k0
        vxorps  xmm0, xmm0, xmm0
        vgatherdps      zmm0 {k1}, zmmword ptr [rax + 4*zmm4]
        vfmadd132ps     zmm2, zmm1, zmmword ptr [rdx + 4*rdi + 128] # zmm2 = (zmm2 * mem) + zmm1
        vfnmsub132ps    zmm0, zmm2, zmmword ptr [rdx + 4*rdi + 192] # zmm0 = -(zmm0 * mem) - zmm2
        add     rdi, 64
        add     rsi, -4
        jne     .LBB0_5

Results in

/tmp/ee3af94421ca41f38d2808205839a8bc.asm: Assembler messages:
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:10: Error: operand size mismatch for `vgatherdps'
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:13: Error: operand size mismatch for `vgatherdps'
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:18: Error: operand size mismatch for `vgatherdps'
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:21: Error: operand size mismatch for `vgatherdps'

Switching to ATT yields

.LBB0_5:                                # %L50
                                        # =>This Inner Loop Header: Depth=1
        vmovups (%r11,%rdi,4), %zmm1
        vmovups 64(%r11,%rdi,4), %zmm2
        vmovups 128(%r11,%rdi,4), %zmm3
        vmovups 192(%r11,%rdi,4), %zmm4
        kxnorw  %k0, %k0, %k1
        vxorps  %xmm5, %xmm5, %xmm5
        vgatherdps      (%rax,%zmm1,4), %zmm5 {%k1}
        kxnorw  %k0, %k0, %k1
        vxorps  %xmm1, %xmm1, %xmm1
        vgatherdps      (%rax,%zmm2,4), %zmm1 {%k1}
        vfmsub132ps     (%rdx,%rdi,4), %zmm0, %zmm5 # zmm5 = (zmm5 * mem) - zmm0
        vfmadd132ps     64(%rdx,%rdi,4), %zmm5, %zmm1 # zmm1 = (zmm1 * mem) + zmm5
        kxnorw  %k0, %k0, %k1
        vxorps  %xmm2, %xmm2, %xmm2
        vgatherdps      (%rax,%zmm3,4), %zmm2 {%k1}
        kxnorw  %k0, %k0, %k1
        vxorps  %xmm0, %xmm0, %xmm0
        vgatherdps      (%rax,%zmm4,4), %zmm0 {%k1}
        vfmadd132ps     128(%rdx,%rdi,4), %zmm1, %zmm2 # zmm2 = (zmm2 * mem) + zmm1
        vfnmsub132ps    192(%rdx,%rdi,4), %zmm2, %zmm0 # zmm0 = -(zmm0 * mem) - zmm2
        addq    $64, %rdi
        addq    $-4, %rsi
        jne     .LBB0_5

Adding .intel_syntax to the Intel version and running llvm-mca locally also works (but I prefer uica over llvm-mca).

ICL with move elimination

It would be nice to have support for ICL with move elimination (i.e. early microcode) to be able to compare it with the current microcode.
Perhaps such support could be added in a systematic way for all microarchitectures that currently have move elimination disabled.

Meanwhile, I have successfully submitted a FreeBSD port of uiCA. It's a bit of a mess but seems to work fine.

simulation inaccuracy: missed dep-breaking of pcmpeq

Integer pcmpeq* with source=dest sets destination to all-ones without dependency on source (but still occupies an execution unit). For example, the following loop runs at one cycle per iteration on Skylake, while uiCA predicts two:

loop:
vpcmpeqd xmm0, xmm0, xmm0
vpor xmm0, xmm0, xmm0
dec ecx
jnz loop

Make it easy to package uiCA

I'm currently working on packaging uiCA for the FreeBSD ports collections.

It would be great if the project came with a standard distutils setup for packaging.

In our use case, we have no access to network during build and packaging, but all submodules will have been checked out at extract time (the xml file will have been fetched, too).

k0 ignores register dependencies

kmovd eax, k1
kmovd k1, eax
kmovd edx, k0
kmovd k0, edx

On uica.uops.info: https://bit.ly/3Wn5k0E

Expected:
The k0 and k1 instructions work the same

Actual:

Time of operations grows?

For example,

	.intel_syntax noprefix
L64:
        vmovapd xmm4, xmm1
        vmulsd  xmm1, xmm2, xmm1
        vsubsd  xmm1, xmm1, xmm3
        vmulsd  xmm3, xmm1, qword ptr [rax + 8*rdx]
        vaddsd  xmm0, xmm0, xmm3
        inc     rdx
        vmovapd xmm3, xmm4
        cmp     rcx, rdx
        jne     L64

as -msyntax=intel cheb2.asm -o cheb2.o
uiCA.py cheb2.o -arch TGL -trace cheb2.html

Python3 and style

Had a few questions:

Namedtuple vs dataclass. After py37, dataclasses offer additional type safety. Any objections to using them?
Indentation - running it through black is fairly common
regex strings need to be raw strings (some of the type checking errors are related to this)

Writeback conflicts

This is a case where uiCA predictons for SKL seem to be pretty far off. Pretty much all tools I know of get this one wrong, despite it only using reg-reg operations.

Test case: https://bit.ly/3jlvOOJ

uiCA predicts 4c/iteration throughput, actual observed throughput on a Skylake laptop (i7-6560U) is 6c/iteration. If you take out one instruction on the non-PSADBW critical path (say comment out the paddd xmm2, xmm3), this does run at 4c/iteration on real HW, and uiCA agrees.

The actual computation here is nonsense, I was just trying to come up with a small repro.

The case this is setting up into is two vector instructions with different latencies on the same port (p5 in this case) that would have to finish in the same cycle. They can't - the vector RF and bypass network can accept one result per port per cycle, no more, as far as I know. I do not know what the exact criteria are, nor why the penalty here is two cycles and not one. I do not know how often this occurs in practice but I do know that I have hit cases in the past where this seems to be a factor.

Python exception when selecting some uarches

First off, congrats on the announcement and initial release! I'm excited to start using it in my work.

I tried using the online version with this block: https://gist.github.com/pervognsen/5e081b19720a8d6954e28f187b50beff

It works on Skylake but when selecting some other uarches like Ice Lake I get the following Python error message:

Traceback (most recent call last):
  File "/uiCA/uiCA.py", line 2413, in <module>
    main()
  File "/uiCA/uiCA.py", line 2363, in main
    TP = float(uopsForRelRound[-1][lastApplicableInstr][-1].retired
IndexError: list index out of range

Instruction latencies are wrong with fused rip-relative loads

(Note: Tested whatever version is on https://uica.uops.info, none locally)
Instructions that use rip-relative addressing seem to be treated as if they had 1 cycle of latency.

e.g.

vpmulhw ymm12, ymm12, ymmword ptr [rip+0x14b]
vpmulhw ymm12, ymm12, ymmword ptr [rax+0x14b]

On uica.uops.info: https://bit.ly/3H3YsAR

Expected:

The rip and rax versions have the same distance between the D and E markers (5 cycles according to uops.info)
The predicted throughput is 10 cycles, since one iteration is two dependent 5-cycle instructions

Actual:

The predicted throughput is 6 cycles

web interface randomly throws errors

with correct piece of code errors are reported

see screenshots:

it usually works after few tries

from what ive seen the specifc code and options do not matter

simulation inaccuracy: port assignment

I was on the fence about reporting this; in the end I figured I should, because the inaccuracy is significant, and I suspect this might be an implementation bug in the Python script rather than a gap in reverse-engineering of port assignment algorithm.

Consider the following testcase (short link: https://bit.ly/3pu17dy)

1:
movzbl (%rdx),%eax
add    $0x2,%rdx
add    %ecx,%eax
movzbl -0x1(%rdx),%ecx
add    %eax,%edi
add    %eax,%ecx
add    %ecx,%edi
cmp    %rsi,%rdx
jne 1b

uica models this as 3 cycles per iteration for SNB, with the second instruction (add $0x2,%rdx) always going to port 5 and getting delayed by 1 cycle because port 5 is already occupied by fused cmp-jne from the previous issue group. The graph indicates that port 5 gets assigned 1.5x more instructions compared to ports 0 and 1.

In reality I'm seeing this loop run close to 2 cycles per iteration on SNB, and port assignment is quite even:

 Performance counter stats for './main':

         9,250,721       uops_dispatched_port.port_0
         9,620,564       uops_dispatched_port.port_1
         5,053,695       uops_dispatched_port.port_2
         5,056,202       uops_dispatched_port.port_3
            22,616       uops_dispatched_port.port_4
        11,392,109       uops_dispatched_port.port_5
        11,926,865      cycles
        45,291,707      instructions              #    3.80  insn per cycle

(the above is perf stat for the whole program, hence some overhead e.g. on port 4)

Exchanging the second and third instruction in the loop slows down the real execution, making port assignment less even, but improves the simulated result, making it more even.

Missing flag in Linux setup script

It seems that in this line

uiCA/setup.sh

Line 9 in 6bd6e54

git submodule deinit --all

of the Linux setup script, a -f is missing, resulting in this error message:

error: the following file has local modifications:
    XED-to-XML
(use --cached to keep the file, or -f to force removal)
fatal: Submodule work tree 'XED-to-XML' contains local modifications; use '-f' to discard them

(The corresponding command in the Windows script contains this flag.)

While this doesn't seem to affect the initial installation, it leaves the repo in a broken state for re-running the script for updates.

Simulation inaccuracy for 256-bit stores on SNB (remaining part of #15)

The fix for issue #15 did not include a correction for 256-bit stores: like loads, they have half throughput compared to their 128-bit SSE counterparts, and the following loop runs at two cycles per iteration:

loop:
vmovaps [rdi], ymm0
dec ecx
jnz loop

Please model non-taken branches as 1*p06 on hsw/skl

At the moment uica removes p0 as a possible execution port on hsw/skl for a branch early on:

uiCA/convertXML.py

Lines 95 to 96 in 9cbbe93

    
           if (archNode.attrib['name'] not in ['ICL', 'TGL', 'RKL', 'ADL-P']) and (XMLInstr.attrib['category'] == 'COND_BR') and (ports == '1*p06'): 
        
              ports = '1*p6' # taken branches can only use port 6

I'd like to suggest that this should be done only for the branch that terminates the input basic block. If the user provided a snippet that contains extra branches in the middle, like I did in issue #14 earlier, they are assumed to be never taken, and hence modeling them as occupying either port 0 or port 6 would be more accurate.

edit: to be clear, I understand that with extra branches the input snippet is not a basic block (potentially an "extended basic block" in compiler developer speak, if jumping into it after the first instruction is impossible)

Type checker errors

I used pyright and it reported the following errors

https://paste.ubuntu.com/p/g3RxCsPrcC/

rd does not support wildcard characters

The line rd /s /q .git\modules\* in setup.cmd does not work because rd does not support wildcard characters.

Averaged port use across iterations?

Here is an example with 27 fma instructions, running 13 of them on port 0 and 14 on port 5. This leads to an estimated throughput of 14 (the maximum of the two).

LLVM-MCA and OSACA predict ports 0 and 5 will be used evenly.
Naively, I'd expect that to be the case when averaging across many iterations of the loop. However, I'm unfamiliar with how these pipeline models work, so at the very least different predictions are interesting.

Here is some very similar assembly (difference being address calculations were fused with the loads), where now all three tools predict work to be evenly split across ports 0 and 5.

Something else interesting is that LLVM-MCA claims the broadcasts require more ports than claimed by uiCA, leading to lower throughput estimates from it.

However, LLVM-MCA is also definitely wrong about port use in some cases where uiCA is right (AVX512 floating point on Ice Lake client / Tiger Lake / Rocket Lake).

Feature: Predict throughput for all possible alignment offsets with error bars

Right now you can only specify a single alignment offset manually. It would be useful to offer an option where uiCA performs its simulation for all possible alignment offsets and gives you error bars for its output values based on that. This would give you an idea of how sensitive/robust a block is with respect to alignment. (Perhaps this could be part of a more general feature to let you simulate across a range of different parameters, e.g. different microarchitectures, with a statistical summary of the variation?)

	if (archNode.attrib['name'] not in ['ICL', 'TGL', 'RKL', 'ADL-P']) and (XMLInstr.attrib['category'] == 'COND_BR') and (ports == '1*p06'):
	ports = '1*p6' # taken branches can only use port 6