andreas-abel / uica Goto Github PK
View Code? Open in Web Editor NEWuops.info Code Analyzer
License: GNU Affero General Public License v3.0
uops.info Code Analyzer
License: GNU Affero General Public License v3.0
For the following testcase (short link: https://bit.ly/3v1QfWV )
loop:
add rax, [rsi]
adc rax, [rsi]
dec rcx
jnz loop
There's the expected critical path over RAX updates, plus unexpected two-cycle critical path that goes over the carry flag:
I'm surprised to see the backward edge from CF to the first instruction there.
On Sandy Bridge (and Ivy Bridge), 256-bit AVX loads and stores had half throughput compared to their 128-bit SSE counterparts, so ideally uiCA should show that the following loop runs at 2 cycles per iteration:
loop:
vmovaps ymm0, [rsi]
vmovaps ymm0, [rsi]
dec ecx
jnz loop
(I guess this might be not straightforward to model, since it's not the same as if load uop occupied port2/3 for two cycles, because on the second cycle the port still can perform store-address part of another store uop?)
It would be great if uiCA were supported as a tool in godbolt similar to llvm-mca. In compiler-explorer/compiler-explorer#2843, RubenRBS indicated that they would be willing to review a PR adding that feature, and indicated similar PRs can serve as a model.
I may be able to get around to doing it at some point, but thought I'd raise this as an issue here in case uiCA authors are interested in pursuing it.
I was exploring variants of a loop and found one where uiCA.py throws an exception even though nothing out of the ordinary seems to happen in the loop.
Short link to uica.uops.info: https://bit.ly/3Pga31D
For reference, the loop on the above link that shows the issue on Skylake..Cascade Lake:
loop:
vmovaps ymm0, [rsi]
vmovaps ymm1, [rsi+32]
vorps ymm0, ymm0, [rsi+64]
vorps ymm1, ymm1, [rsi+96]
vorps ymm1, ymm1, [rsi+128]
vorps ymm1, ymm1, [rsi+160]
vorps ymm0, ymm0, [rsi+192]
vmovaps xmm2, [rsi+224]
vpor xmm2, xmm2, [rsi+240]
add rsi, 256
vorps ymm0, ymm0, ymm1
vextractf128 xmm1, ymm0, 1
vpor xmm0, xmm0, xmm1
vpor xmm0, xmm0, xmm2
vpcmpeqb xmm0, xmm0, xmm7
vpmovmskb eax, xmm0
test eax, eax
jz out
dec ecx
jnz loop
out:
Exception trace:
Traceback (most recent call last):
File "/uiCA/uiCA.py", line 2448, in <module>
main()
File "/uiCA/uiCA.py", line 2441, in main
TP = runSimulation(disas, uArchConfig, int(args.alignmentOffset), args.initPolicy, args.noMicroFusion, args.noMacroFusion, args.simpleFrontEnd,
File "/uiCA/uiCA.py", line 2294, in runSimulation
frontEnd.cycle(clock)
File "/uiCA/uiCA.py", line 572, in cycle
newInstrIUops = self.DSB.cycle()
File "/uiCA/uiCA.py", line 743, in cycle
DSBBlock = self.DSBBlockQueue[0]
IndexError: deque index out of range
Thank you so much for uiCA!
.LBB0_5: # %L50
# =>This Inner Loop Header: Depth=1
vmovups zmm1, zmmword ptr [r11 + 4*rdi]
vmovups zmm2, zmmword ptr [r11 + 4*rdi + 64]
vmovups zmm3, zmmword ptr [r11 + 4*rdi + 128]
vmovups zmm4, zmmword ptr [r11 + 4*rdi + 192]
kxnorw k1, k0, k0
vxorps xmm5, xmm5, xmm5
vgatherdps zmm5 {k1}, zmmword ptr [rax + 4*zmm1]
kxnorw k1, k0, k0
vxorps xmm1, xmm1, xmm1
vgatherdps zmm1 {k1}, zmmword ptr [rax + 4*zmm2]
vfmsub132ps zmm5, zmm0, zmmword ptr [rdx + 4*rdi] # zmm5 = (zmm5 * mem) - zmm0
vfmadd132ps zmm1, zmm5, zmmword ptr [rdx + 4*rdi + 64] # zmm1 = (zmm1 * mem) + zmm5
kxnorw k1, k0, k0
vxorps xmm2, xmm2, xmm2
vgatherdps zmm2 {k1}, zmmword ptr [rax + 4*zmm3]
kxnorw k1, k0, k0
vxorps xmm0, xmm0, xmm0
vgatherdps zmm0 {k1}, zmmword ptr [rax + 4*zmm4]
vfmadd132ps zmm2, zmm1, zmmword ptr [rdx + 4*rdi + 128] # zmm2 = (zmm2 * mem) + zmm1
vfnmsub132ps zmm0, zmm2, zmmword ptr [rdx + 4*rdi + 192] # zmm0 = -(zmm0 * mem) - zmm2
add rdi, 64
add rsi, -4
jne .LBB0_5
Results in
/tmp/ee3af94421ca41f38d2808205839a8bc.asm: Assembler messages:
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:10: Error: operand size mismatch for `vgatherdps'
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:13: Error: operand size mismatch for `vgatherdps'
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:18: Error: operand size mismatch for `vgatherdps'
/tmp/ee3af94421ca41f38d2808205839a8bc.asm:21: Error: operand size mismatch for `vgatherdps'
Switching to ATT yields
.LBB0_5: # %L50
# =>This Inner Loop Header: Depth=1
vmovups (%r11,%rdi,4), %zmm1
vmovups 64(%r11,%rdi,4), %zmm2
vmovups 128(%r11,%rdi,4), %zmm3
vmovups 192(%r11,%rdi,4), %zmm4
kxnorw %k0, %k0, %k1
vxorps %xmm5, %xmm5, %xmm5
vgatherdps (%rax,%zmm1,4), %zmm5 {%k1}
kxnorw %k0, %k0, %k1
vxorps %xmm1, %xmm1, %xmm1
vgatherdps (%rax,%zmm2,4), %zmm1 {%k1}
vfmsub132ps (%rdx,%rdi,4), %zmm0, %zmm5 # zmm5 = (zmm5 * mem) - zmm0
vfmadd132ps 64(%rdx,%rdi,4), %zmm5, %zmm1 # zmm1 = (zmm1 * mem) + zmm5
kxnorw %k0, %k0, %k1
vxorps %xmm2, %xmm2, %xmm2
vgatherdps (%rax,%zmm3,4), %zmm2 {%k1}
kxnorw %k0, %k0, %k1
vxorps %xmm0, %xmm0, %xmm0
vgatherdps (%rax,%zmm4,4), %zmm0 {%k1}
vfmadd132ps 128(%rdx,%rdi,4), %zmm1, %zmm2 # zmm2 = (zmm2 * mem) + zmm1
vfnmsub132ps 192(%rdx,%rdi,4), %zmm2, %zmm0 # zmm0 = -(zmm0 * mem) - zmm2
addq $64, %rdi
addq $-4, %rsi
jne .LBB0_5
Adding .intel_syntax
to the Intel version and running llvm-mca locally also works (but I prefer uica over llvm-mca).
It would be nice to have support for ICL with move elimination (i.e. early microcode) to be able to compare it with the current microcode.
Perhaps such support could be added in a systematic way for all microarchitectures that currently have move elimination disabled.
Meanwhile, I have successfully submitted a FreeBSD port of uiCA. It's a bit of a mess but seems to work fine.
Integer pcmpeq*
with source=dest sets destination to all-ones without dependency on source (but still occupies an execution unit). For example, the following loop runs at one cycle per iteration on Skylake, while uiCA predicts two:
loop:
vpcmpeqd xmm0, xmm0, xmm0
vpor xmm0, xmm0, xmm0
dec ecx
jnz loop
I'm currently working on packaging uiCA for the FreeBSD ports collections.
It would be great if the project came with a standard distutils setup for packaging.
In our use case, we have no access to network during build and packaging, but all submodules will have been checked out at extract time (the xml file will have been fetched, too).
kmovd eax, k1
kmovd k1, eax
kmovd edx, k0
kmovd k0, edx
On uica.uops.info: https://bit.ly/3Wn5k0E
Expected:
The k0 and k1 instructions work the same
For example,
.intel_syntax noprefix
L64:
vmovapd xmm4, xmm1
vmulsd xmm1, xmm2, xmm1
vsubsd xmm1, xmm1, xmm3
vmulsd xmm3, xmm1, qword ptr [rax + 8*rdx]
vaddsd xmm0, xmm0, xmm3
inc rdx
vmovapd xmm3, xmm4
cmp rcx, rdx
jne L64
as -msyntax=intel cheb2.asm -o cheb2.o
uiCA.py cheb2.o -arch TGL -trace cheb2.html
Had a few questions:
black
is fairly commonThis is a case where uiCA predictons for SKL seem to be pretty far off. Pretty much all tools I know of get this one wrong, despite it only using reg-reg operations.
Test case: https://bit.ly/3jlvOOJ
uiCA predicts 4c/iteration throughput, actual observed throughput on a Skylake laptop (i7-6560U) is 6c/iteration. If you take out one instruction on the non-PSADBW critical path (say comment out the paddd xmm2, xmm3
), this does run at 4c/iteration on real HW, and uiCA agrees.
The actual computation here is nonsense, I was just trying to come up with a small repro.
The case this is setting up into is two vector instructions with different latencies on the same port (p5 in this case) that would have to finish in the same cycle. They can't - the vector RF and bypass network can accept one result per port per cycle, no more, as far as I know. I do not know what the exact criteria are, nor why the penalty here is two cycles and not one. I do not know how often this occurs in practice but I do know that I have hit cases in the past where this seems to be a factor.
First off, congrats on the announcement and initial release! I'm excited to start using it in my work.
I tried using the online version with this block: https://gist.github.com/pervognsen/5e081b19720a8d6954e28f187b50beff
It works on Skylake but when selecting some other uarches like Ice Lake I get the following Python error message:
Traceback (most recent call last):
File "/uiCA/uiCA.py", line 2413, in <module>
main()
File "/uiCA/uiCA.py", line 2363, in main
TP = float(uopsForRelRound[-1][lastApplicableInstr][-1].retired
IndexError: list index out of range
(Note: Tested whatever version is on https://uica.uops.info, none locally)
Instructions that use rip-relative addressing seem to be treated as if they had 1 cycle of latency.
e.g.
vpmulhw ymm12, ymm12, ymmword ptr [rip+0x14b]
vpmulhw ymm12, ymm12, ymmword ptr [rax+0x14b]
On uica.uops.info: https://bit.ly/3H3YsAR
Expected:
rip
and rax
versions have the same distance between the D
and E
markers (5 cycles according to uops.info)Actual:
I was on the fence about reporting this; in the end I figured I should, because the inaccuracy is significant, and I suspect this might be an implementation bug in the Python script rather than a gap in reverse-engineering of port assignment algorithm.
Consider the following testcase (short link: https://bit.ly/3pu17dy)
1:
movzbl (%rdx),%eax
add $0x2,%rdx
add %ecx,%eax
movzbl -0x1(%rdx),%ecx
add %eax,%edi
add %eax,%ecx
add %ecx,%edi
cmp %rsi,%rdx
jne 1b
uica models this as 3 cycles per iteration for SNB, with the second instruction (add $0x2,%rdx
) always going to port 5 and getting delayed by 1 cycle because port 5 is already occupied by fused cmp-jne from the previous issue group. The graph indicates that port 5 gets assigned 1.5x more instructions compared to ports 0 and 1.
In reality I'm seeing this loop run close to 2 cycles per iteration on SNB, and port assignment is quite even:
Performance counter stats for './main':
9,250,721 uops_dispatched_port.port_0
9,620,564 uops_dispatched_port.port_1
5,053,695 uops_dispatched_port.port_2
5,056,202 uops_dispatched_port.port_3
22,616 uops_dispatched_port.port_4
11,392,109 uops_dispatched_port.port_5
11,926,865 cycles
45,291,707 instructions # 3.80 insn per cycle
(the above is perf stat
for the whole program, hence some overhead e.g. on port 4)
Exchanging the second and third instruction in the loop slows down the real execution, making port assignment less even, but improves the simulated result, making it more even.
It seems that in this line
Line 9 in 6bd6e54
-f
is missing, resulting in this error message:
error: the following file has local modifications:
XED-to-XML
(use --cached to keep the file, or -f to force removal)
fatal: Submodule work tree 'XED-to-XML' contains local modifications; use '-f' to discard them
(The corresponding command in the Windows script contains this flag.)
While this doesn't seem to affect the initial installation, it leaves the repo in a broken state for re-running the script for updates.
The fix for issue #15 did not include a correction for 256-bit stores: like loads, they have half throughput compared to their 128-bit SSE counterparts, and the following loop runs at two cycles per iteration:
loop:
vmovaps [rdi], ymm0
dec ecx
jnz loop
At the moment uica removes p0 as a possible execution port on hsw/skl for a branch early on:
Lines 95 to 96 in 9cbbe93
I'd like to suggest that this should be done only for the branch that terminates the input basic block. If the user provided a snippet that contains extra branches in the middle, like I did in issue #14 earlier, they are assumed to be never taken, and hence modeling them as occupying either port 0 or port 6 would be more accurate.
edit: to be clear, I understand that with extra branches the input snippet is not a basic block (potentially an "extended basic block" in compiler developer speak, if jumping into it after the first instruction is impossible)
I used pyright and it reported the following errors
The line rd /s /q .git\modules\*
in setup.cmd
does not work because rd
does not support wildcard characters.
Here is an example with 27 fma instructions, running 13 of them on port 0 and 14 on port 5. This leads to an estimated throughput of 14 (the maximum of the two).
LLVM-MCA and OSACA predict ports 0 and 5 will be used evenly.
Naively, I'd expect that to be the case when averaging across many iterations of the loop. However, I'm unfamiliar with how these pipeline models work, so at the very least different predictions are interesting.
Here is some very similar assembly (difference being address calculations were fused with the loads), where now all three tools predict work to be evenly split across ports 0 and 5.
Something else interesting is that LLVM-MCA claims the broadcasts require more ports than claimed by uiCA, leading to lower throughput estimates from it.
However, LLVM-MCA is also definitely wrong about port use in some cases where uiCA is right (AVX512 floating point on Ice Lake client / Tiger Lake / Rocket Lake).
Right now you can only specify a single alignment offset manually. It would be useful to offer an option where uiCA performs its simulation for all possible alignment offsets and gives you error bars for its output values based on that. This would give you an idea of how sensitive/robust a block is with respect to alignment. (Perhaps this could be part of a more general feature to let you simulate across a range of different parameters, e.g. different microarchitectures, with a statistical summary of the variation?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.