The constantnodemultiiterationtest from imays76

실험

안녕하세요? 뭔가 흥미로워서 실험해봤습니다. 실행시 linux의 perf tool을 이용해서 microarchitecture 수준으로 분석해보려했습니다.

코드에 다음의 2가지 수정했습니다.

linux에서 빌드되도록 헤더등 변경
코드 수준 시간 측정 제거. test대상 함수 (ConstantNodeMultiJump 및 NodeMultiJump) 부분을 각 100회씩 반복. code에서 실행시간을 안 재고 perf tool을 쓰기위해, 실행시간에서 memory할당에 들어가는 portion을 최소화 (초기화 약 4초 소모, 전체 실행시간 40초이므로 약 10%는 malloc에 걸리는 시간)

결론을 요약하면, ConstantNodeMultiJump 가 빠른데 (41초 vs 49초), 그 이유는 단순히 loop에 들어가는 branch 명령어 개수가 줄어들었기 때문인거 같네요. (branch가 전체 명령에서 20% 이상 차지). Cache miss는 NodeMultiJump쪽이 적은데 오히려 branch명령어가 너무 많아서 느려지네요.

테스트 환경입니다.

OS: Centos 6.4
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)

컴파일 및 실행옵션입니다. inlining 안되도록 했습니다. 저 옵션이 없으면 main함수에 두 함수 모두 inline이 됩니다.

g++ -Wall -O3 -fno-inline-small-functions test.cc -o test
perf stat --detailed ./test const  ;precomputed chasing (ConstantNodeMultiJump )
perf stat --detailed ./test dyn     ;loop chasing (NodeMultiJump)

프로세서 (2CPU SMP 서버인데 다른 부하 없는환경에서 테스트했습니다)

vendor_id   : GenuineIntel
cpu family  : 6
model       : 45
model name  : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
stepping    : 7
cpu MHz     : 2700.131
cache size  : 20480 KB
physical id : 1
siblings    : 16
core id     : 7
cpu cores   : 8
apicid      : 47
initial apicid  : 47
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5399.24
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

ConstantNodeMultiJump 실행결과

init..
Const iteration starts...
iter = 0xbfe21010

 Performance counter stats for './test const':
      41647.912130 task-clock                #    0.999 CPUs utilized          
                54 context-switches          #    0.001 K/sec                  
                 3 cpu-migrations            #    0.000 K/sec                  
           781,549 page-faults               #    0.019 M/sec                  
   111,772,982,470 cycles                    #    2.684 GHz                     [40.00%]
    97,093,930,469 stalled-cycles-frontend   #   86.87% frontend cycles idle    [40.00%]
    79,613,541,313 stalled-cycles-backend    #   71.23% backend  cycles idle    [40.00%]
    48,308,189,834 instructions              #    0.43  insns per cycle        
                                             #    2.01  stalled cycles per insn [49.99%]
    10,109,395,751 branches                  #  242.735 M/sec                   [49.99%]
           789,580 branch-misses             #    0.01% of all branches         [50.00%]
    16,462,138,947 L1-dcache-loads           #  395.269 M/sec                   [50.00%]
     5,068,265,572 L1-dcache-load-misses     #   30.79% of all L1-dcache hits   [50.00%]
     7,390,572,483 LLC-loads                 #  177.454 M/sec                   [40.01%]
     4,381,740,374 LLC-load-misses           #   59.29% of all LL-cache hits    [40.00%]

      41.677833250 seconds time elapsed

NodeMultiJump 실행결과

init..
Dynamic iteration starts...
iter = 0xc0903010

 Performance counter stats for './test dyn':
      49823.521151 task-clock                #    0.999 CPUs utilized          
                62 context-switches          #    0.001 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
           781,550 page-faults               #    0.016 M/sec                  
   133,739,155,166 cycles                    #    2.684 GHz                     [40.00%]
   114,271,910,122 stalled-cycles-frontend   #   85.44% frontend cycles idle    [39.99%]
    89,506,080,739 stalled-cycles-backend    #   66.93% backend  cycles idle    [39.99%]
    78,172,559,769 instructions              #    0.58  insns per cycle        
                                             #    1.46  stalled cycles per insn [49.99%]
    18,099,100,887 branches                  #  363.264 M/sec                   [49.99%]
         2,469,814 branch-misses             #    0.01% of all branches         [50.00%]
    17,410,143,512 L1-dcache-loads           #  349.436 M/sec                   [50.00%]
     5,072,684,004 L1-dcache-load-misses     #   29.14% of all L1-dcache hits   [50.01%]
     5,258,879,909 LLC-loads                 #  105.550 M/sec                   [40.01%]
     3,351,536,144 LLC-load-misses           #   63.73% of all LL-cache hits    [40.01%]

      49.859166477 seconds time elapsed

각 함수 Dump (코드 design대로 MOV 사용해서 branch code없이 pointer chasing이 unroll된것을 볼수있습니다.)

0000000000400a70 <_Z21ConstantNodeMultiJumpiP4Node>:
  400a70:   83 ff 0a                cmp    $0xa,%edi
  400a73:   77 2f                   ja     400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400a75:   89 ff                   mov    %edi,%edi
  400a77:   ff 24 fd d8 0f 40 00    jmpq   *0x400fd8(,%rdi,8)
  400a7e:   66 90                   xchg   %ax,%ax
  400a80:   48 8b 46 08             mov    0x8(%rsi),%rax
  400a84:   48 8b 40 08             mov    0x8(%rax),%rax
  400a88:   48 8b 40 08             mov    0x8(%rax),%rax
  400a8c:   48 8b 40 08             mov    0x8(%rax),%rax
  400a90:   48 8b 40 08             mov    0x8(%rax),%rax
  400a94:   48 8b 40 08             mov    0x8(%rax),%rax
  400a98:   48 8b 40 08             mov    0x8(%rax),%rax
  400a9c:   48 8b 40 08             mov    0x8(%rax),%rax
  400aa0:   48 8b 70 08             mov    0x8(%rax),%rsi
  400aa4:   48 89 f0                mov    %rsi,%rax
  400aa7:   c3                      retq   
  400aa8:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  400aaf:   00 
  400ab0:   48 8b 46 08             mov    0x8(%rsi),%rax
  400ab4:   48 8b 40 08             mov    0x8(%rax),%rax
  400ab8:   48 8b 40 08             mov    0x8(%rax),%rax
  400abc:   48 8b 40 08             mov    0x8(%rax),%rax
  400ac0:   48 8b 40 08             mov    0x8(%rax),%rax
  400ac4:   eb ca                   jmp    400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400ac6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400acd:   00 00 00 
  400ad0:   48 8b 76 08             mov    0x8(%rsi),%rsi
  400ad4:   eb ce                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400ad6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400add:   00 00 00 
  400ae0:   48 8b 46 08             mov    0x8(%rsi),%rax
  400ae4:   48 8b 70 08             mov    0x8(%rax),%rsi
  400ae8:   eb ba                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400aea:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  400af0:   48 8b 46 08             mov    0x8(%rsi),%rax
  400af4:   48 8b 40 08             mov    0x8(%rax),%rax
  400af8:   48 8b 70 08             mov    0x8(%rax),%rsi
  400afc:   eb a6                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400afe:   66 90                   xchg   %ax,%ax
  400b00:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b04:   48 8b 40 08             mov    0x8(%rax),%rax
  400b08:   48 8b 40 08             mov    0x8(%rax),%rax
  400b0c:   48 8b 70 08             mov    0x8(%rax),%rsi
  400b10:   eb 92                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400b12:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  400b18:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b1c:   48 8b 40 08             mov    0x8(%rax),%rax
  400b20:   48 8b 40 08             mov    0x8(%rax),%rax
  400b24:   48 8b 40 08             mov    0x8(%rax),%rax
  400b28:   48 8b 70 08             mov    0x8(%rax),%rsi
  400b2c:   e9 73 ff ff ff          jmpq   400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400b31:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400b38:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b3c:   e9 4f ff ff ff          jmpq   400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400b41:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400b48:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b4c:   48 8b 40 08             mov    0x8(%rax),%rax
  400b50:   e9 3b ff ff ff          jmpq   400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400b55:   0f 1f 00                nopl   (%rax)
  400b58:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b5c:   48 8b 40 08             mov    0x8(%rax),%rax
  400b60:   48 8b 40 08             mov    0x8(%rax),%rax
  400b64:   e9 27 ff ff ff          jmpq   400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400b69:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

0000000000400b70 <_Z13NodeMultiJumpiP4Node>:
  400b70:   85 ff                   test   %edi,%edi
  400b72:   48 89 f0                mov    %rsi,%rax
  400b75:   7e 14                   jle    400b8b <_Z13NodeMultiJumpiP4Node+0x1b>
  400b77:   31 d2                   xor    %edx,%edx
  400b79:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400b80:   83 c2 01                add    $0x1,%edx
  400b83:   48 8b 40 08             mov    0x8(%rax),%rax
  400b87:   39 fa                   cmp    %edi,%edx
  400b89:   75 f5                   jne    400b80 <_Z13NodeMultiJumpiP4Node+0x10>
  400b8b:   f3 c3                   repz retq 
  400b8d:   0f 1f 00                nopl   (%rax)

imays76 / constantnodemultiiterationtest Goto Github PK

constantnodemultiiterationtest's Introduction

ConstantNodeMultiIterationTest

constantnodemultiiterationtest's People

Contributors

Watchers

constantnodemultiiterationtest's Issues

실험

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent