안녕하세요? 뭔가 흥미로워서 실험해봤습니다. 실행시 linux의 perf tool을 이용해서 microarchitecture 수준으로 분석해보려했습니다.
코드에 다음의 2가지 수정했습니다.
- linux에서 빌드되도록 헤더등 변경
- 코드 수준 시간 측정 제거. test대상 함수 (ConstantNodeMultiJump 및 NodeMultiJump) 부분을 각 100회씩 반복. code에서 실행시간을 안 재고 perf tool을 쓰기위해, 실행시간에서 memory할당에 들어가는 portion을 최소화 (초기화 약 4초 소모, 전체 실행시간 40초이므로 약 10%는 malloc에 걸리는 시간)
결론을 요약하면, ConstantNodeMultiJump 가 빠른데 (41초 vs 49초), 그 이유는 단순히 loop에 들어가는 branch 명령어 개수가 줄어들었기 때문인거 같네요. (branch가 전체 명령에서 20% 이상 차지). Cache miss는 NodeMultiJump쪽이 적은데 오히려 branch명령어가 너무 많아서 느려지네요.
테스트 환경입니다.
OS: Centos 6.4
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
컴파일 및 실행옵션입니다. inlining 안되도록 했습니다. 저 옵션이 없으면 main함수에 두 함수 모두 inline이 됩니다.
g++ -Wall -O3 -fno-inline-small-functions test.cc -o test
perf stat --detailed ./test const ;precomputed chasing (ConstantNodeMultiJump )
perf stat --detailed ./test dyn ;loop chasing (NodeMultiJump)
프로세서 (2CPU SMP 서버인데 다른 부하 없는환경에서 테스트했습니다)
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
stepping : 7
cpu MHz : 2700.131
cache size : 20480 KB
physical id : 1
siblings : 16
core id : 7
cpu cores : 8
apicid : 47
initial apicid : 47
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5399.24
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
ConstantNodeMultiJump 실행결과
init..
Const iteration starts...
iter = 0xbfe21010
Performance counter stats for './test const':
41647.912130 task-clock # 0.999 CPUs utilized
54 context-switches # 0.001 K/sec
3 cpu-migrations # 0.000 K/sec
781,549 page-faults # 0.019 M/sec
111,772,982,470 cycles # 2.684 GHz [40.00%]
97,093,930,469 stalled-cycles-frontend # 86.87% frontend cycles idle [40.00%]
79,613,541,313 stalled-cycles-backend # 71.23% backend cycles idle [40.00%]
48,308,189,834 instructions # 0.43 insns per cycle
# 2.01 stalled cycles per insn [49.99%]
10,109,395,751 branches # 242.735 M/sec [49.99%]
789,580 branch-misses # 0.01% of all branches [50.00%]
16,462,138,947 L1-dcache-loads # 395.269 M/sec [50.00%]
5,068,265,572 L1-dcache-load-misses # 30.79% of all L1-dcache hits [50.00%]
7,390,572,483 LLC-loads # 177.454 M/sec [40.01%]
4,381,740,374 LLC-load-misses # 59.29% of all LL-cache hits [40.00%]
41.677833250 seconds time elapsed
NodeMultiJump 실행결과
init..
Dynamic iteration starts...
iter = 0xc0903010
Performance counter stats for './test dyn':
49823.521151 task-clock # 0.999 CPUs utilized
62 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
781,550 page-faults # 0.016 M/sec
133,739,155,166 cycles # 2.684 GHz [40.00%]
114,271,910,122 stalled-cycles-frontend # 85.44% frontend cycles idle [39.99%]
89,506,080,739 stalled-cycles-backend # 66.93% backend cycles idle [39.99%]
78,172,559,769 instructions # 0.58 insns per cycle
# 1.46 stalled cycles per insn [49.99%]
18,099,100,887 branches # 363.264 M/sec [49.99%]
2,469,814 branch-misses # 0.01% of all branches [50.00%]
17,410,143,512 L1-dcache-loads # 349.436 M/sec [50.00%]
5,072,684,004 L1-dcache-load-misses # 29.14% of all L1-dcache hits [50.01%]
5,258,879,909 LLC-loads # 105.550 M/sec [40.01%]
3,351,536,144 LLC-load-misses # 63.73% of all LL-cache hits [40.01%]
49.859166477 seconds time elapsed
각 함수 Dump (코드 design대로 MOV 사용해서 branch code없이 pointer chasing이 unroll된것을 볼수있습니다.)
0000000000400a70 <_Z21ConstantNodeMultiJumpiP4Node>:
400a70: 83 ff 0a cmp $0xa,%edi
400a73: 77 2f ja 400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
400a75: 89 ff mov %edi,%edi
400a77: ff 24 fd d8 0f 40 00 jmpq *0x400fd8(,%rdi,8)
400a7e: 66 90 xchg %ax,%ax
400a80: 48 8b 46 08 mov 0x8(%rsi),%rax
400a84: 48 8b 40 08 mov 0x8(%rax),%rax
400a88: 48 8b 40 08 mov 0x8(%rax),%rax
400a8c: 48 8b 40 08 mov 0x8(%rax),%rax
400a90: 48 8b 40 08 mov 0x8(%rax),%rax
400a94: 48 8b 40 08 mov 0x8(%rax),%rax
400a98: 48 8b 40 08 mov 0x8(%rax),%rax
400a9c: 48 8b 40 08 mov 0x8(%rax),%rax
400aa0: 48 8b 70 08 mov 0x8(%rax),%rsi
400aa4: 48 89 f0 mov %rsi,%rax
400aa7: c3 retq
400aa8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
400aaf: 00
400ab0: 48 8b 46 08 mov 0x8(%rsi),%rax
400ab4: 48 8b 40 08 mov 0x8(%rax),%rax
400ab8: 48 8b 40 08 mov 0x8(%rax),%rax
400abc: 48 8b 40 08 mov 0x8(%rax),%rax
400ac0: 48 8b 40 08 mov 0x8(%rax),%rax
400ac4: eb ca jmp 400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
400ac6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
400acd: 00 00 00
400ad0: 48 8b 76 08 mov 0x8(%rsi),%rsi
400ad4: eb ce jmp 400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
400ad6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
400add: 00 00 00
400ae0: 48 8b 46 08 mov 0x8(%rsi),%rax
400ae4: 48 8b 70 08 mov 0x8(%rax),%rsi
400ae8: eb ba jmp 400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
400aea: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400af0: 48 8b 46 08 mov 0x8(%rsi),%rax
400af4: 48 8b 40 08 mov 0x8(%rax),%rax
400af8: 48 8b 70 08 mov 0x8(%rax),%rsi
400afc: eb a6 jmp 400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
400afe: 66 90 xchg %ax,%ax
400b00: 48 8b 46 08 mov 0x8(%rsi),%rax
400b04: 48 8b 40 08 mov 0x8(%rax),%rax
400b08: 48 8b 40 08 mov 0x8(%rax),%rax
400b0c: 48 8b 70 08 mov 0x8(%rax),%rsi
400b10: eb 92 jmp 400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
400b12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400b18: 48 8b 46 08 mov 0x8(%rsi),%rax
400b1c: 48 8b 40 08 mov 0x8(%rax),%rax
400b20: 48 8b 40 08 mov 0x8(%rax),%rax
400b24: 48 8b 40 08 mov 0x8(%rax),%rax
400b28: 48 8b 70 08 mov 0x8(%rax),%rsi
400b2c: e9 73 ff ff ff jmpq 400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
400b31: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400b38: 48 8b 46 08 mov 0x8(%rsi),%rax
400b3c: e9 4f ff ff ff jmpq 400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
400b41: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400b48: 48 8b 46 08 mov 0x8(%rsi),%rax
400b4c: 48 8b 40 08 mov 0x8(%rax),%rax
400b50: e9 3b ff ff ff jmpq 400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
400b55: 0f 1f 00 nopl (%rax)
400b58: 48 8b 46 08 mov 0x8(%rsi),%rax
400b5c: 48 8b 40 08 mov 0x8(%rax),%rax
400b60: 48 8b 40 08 mov 0x8(%rax),%rax
400b64: e9 27 ff ff ff jmpq 400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
400b69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
0000000000400b70 <_Z13NodeMultiJumpiP4Node>:
400b70: 85 ff test %edi,%edi
400b72: 48 89 f0 mov %rsi,%rax
400b75: 7e 14 jle 400b8b <_Z13NodeMultiJumpiP4Node+0x1b>
400b77: 31 d2 xor %edx,%edx
400b79: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400b80: 83 c2 01 add $0x1,%edx
400b83: 48 8b 40 08 mov 0x8(%rax),%rax
400b87: 39 fa cmp %edi,%edx
400b89: 75 f5 jne 400b80 <_Z13NodeMultiJumpiP4Node+0x10>
400b8b: f3 c3 repz retq
400b8d: 0f 1f 00 nopl (%rax)