Coder Social home page Coder Social logo

constantnodemultiiterationtest's Introduction

ConstantNodeMultiIterationTest

2회 이상의 linked list를 순회할 때, loop 분기 명령을 쓰는 횟수를 절약해 보면 어떨까?

...라는 발상으로 만들어진, 나비 모양(...)의 코드.

성능 테스트를 해봤습니다.

#테스트 시나리오

  1. linked list를 평벙한 방식으로 끝까지 순회하기.
  2. listed list를 10개마다 나비모양 코드로 분기명령 없이 순회하기.

CPU가 얼마나 스마트하게 처리하는지를 보기 위해, 컴파일러 최적화 끔. 당연히 릴리즈 모드.

#테스트 결과 후자가 약 1.5배정도 빠름.

#테스트 결과에 대한 생각 분기명령을 줄이는 것이 효과적이긴 하지만, 그래도 cache miss가 일어나는 메모리 액세스에 비할 바가 아니다.

#기타사항

대충 짠 코드이고, 정확한 결과라고 바로 장담 못합니다.

집단지성으로 정확화 기대해봅니다.

constantnodemultiiterationtest's People

Contributors

imays76 avatar

Watchers

James Cloos avatar  avatar

constantnodemultiiterationtest's Issues

실험

안녕하세요? 뭔가 흥미로워서 실험해봤습니다. 실행시 linux의 perf tool을 이용해서 microarchitecture 수준으로 분석해보려했습니다.

코드에 다음의 2가지 수정했습니다.

  1. linux에서 빌드되도록 헤더등 변경
  2. 코드 수준 시간 측정 제거. test대상 함수 (ConstantNodeMultiJump 및 NodeMultiJump) 부분을 각 100회씩 반복. code에서 실행시간을 안 재고 perf tool을 쓰기위해, 실행시간에서 memory할당에 들어가는 portion을 최소화 (초기화 약 4초 소모, 전체 실행시간 40초이므로 약 10%는 malloc에 걸리는 시간)

결론을 요약하면, ConstantNodeMultiJump 가 빠른데 (41초 vs 49초), 그 이유는 단순히 loop에 들어가는 branch 명령어 개수가 줄어들었기 때문인거 같네요. (branch가 전체 명령에서 20% 이상 차지). Cache miss는 NodeMultiJump쪽이 적은데 오히려 branch명령어가 너무 많아서 느려지네요.

테스트 환경입니다.

OS: Centos 6.4
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)

컴파일 및 실행옵션입니다. inlining 안되도록 했습니다. 저 옵션이 없으면 main함수에 두 함수 모두 inline이 됩니다.

g++ -Wall -O3 -fno-inline-small-functions test.cc -o test
perf stat --detailed ./test const  ;precomputed chasing (ConstantNodeMultiJump )
perf stat --detailed ./test dyn     ;loop chasing (NodeMultiJump)

프로세서 (2CPU SMP 서버인데 다른 부하 없는환경에서 테스트했습니다)

vendor_id   : GenuineIntel
cpu family  : 6
model       : 45
model name  : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
stepping    : 7
cpu MHz     : 2700.131
cache size  : 20480 KB
physical id : 1
siblings    : 16
core id     : 7
cpu cores   : 8
apicid      : 47
initial apicid  : 47
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5399.24
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

ConstantNodeMultiJump 실행결과

init..
Const iteration starts...
iter = 0xbfe21010

 Performance counter stats for './test const':
      41647.912130 task-clock                #    0.999 CPUs utilized          
                54 context-switches          #    0.001 K/sec                  
                 3 cpu-migrations            #    0.000 K/sec                  
           781,549 page-faults               #    0.019 M/sec                  
   111,772,982,470 cycles                    #    2.684 GHz                     [40.00%]
    97,093,930,469 stalled-cycles-frontend   #   86.87% frontend cycles idle    [40.00%]
    79,613,541,313 stalled-cycles-backend    #   71.23% backend  cycles idle    [40.00%]
    48,308,189,834 instructions              #    0.43  insns per cycle        
                                             #    2.01  stalled cycles per insn [49.99%]
    10,109,395,751 branches                  #  242.735 M/sec                   [49.99%]
           789,580 branch-misses             #    0.01% of all branches         [50.00%]
    16,462,138,947 L1-dcache-loads           #  395.269 M/sec                   [50.00%]
     5,068,265,572 L1-dcache-load-misses     #   30.79% of all L1-dcache hits   [50.00%]
     7,390,572,483 LLC-loads                 #  177.454 M/sec                   [40.01%]
     4,381,740,374 LLC-load-misses           #   59.29% of all LL-cache hits    [40.00%]

      41.677833250 seconds time elapsed

NodeMultiJump 실행결과

init..
Dynamic iteration starts...
iter = 0xc0903010

 Performance counter stats for './test dyn':
      49823.521151 task-clock                #    0.999 CPUs utilized          
                62 context-switches          #    0.001 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
           781,550 page-faults               #    0.016 M/sec                  
   133,739,155,166 cycles                    #    2.684 GHz                     [40.00%]
   114,271,910,122 stalled-cycles-frontend   #   85.44% frontend cycles idle    [39.99%]
    89,506,080,739 stalled-cycles-backend    #   66.93% backend  cycles idle    [39.99%]
    78,172,559,769 instructions              #    0.58  insns per cycle        
                                             #    1.46  stalled cycles per insn [49.99%]
    18,099,100,887 branches                  #  363.264 M/sec                   [49.99%]
         2,469,814 branch-misses             #    0.01% of all branches         [50.00%]
    17,410,143,512 L1-dcache-loads           #  349.436 M/sec                   [50.00%]
     5,072,684,004 L1-dcache-load-misses     #   29.14% of all L1-dcache hits   [50.01%]
     5,258,879,909 LLC-loads                 #  105.550 M/sec                   [40.01%]
     3,351,536,144 LLC-load-misses           #   63.73% of all LL-cache hits    [40.01%]

      49.859166477 seconds time elapsed

각 함수 Dump (코드 design대로 MOV 사용해서 branch code없이 pointer chasing이 unroll된것을 볼수있습니다.)

0000000000400a70 <_Z21ConstantNodeMultiJumpiP4Node>:
  400a70:   83 ff 0a                cmp    $0xa,%edi
  400a73:   77 2f                   ja     400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400a75:   89 ff                   mov    %edi,%edi
  400a77:   ff 24 fd d8 0f 40 00    jmpq   *0x400fd8(,%rdi,8)
  400a7e:   66 90                   xchg   %ax,%ax
  400a80:   48 8b 46 08             mov    0x8(%rsi),%rax
  400a84:   48 8b 40 08             mov    0x8(%rax),%rax
  400a88:   48 8b 40 08             mov    0x8(%rax),%rax
  400a8c:   48 8b 40 08             mov    0x8(%rax),%rax
  400a90:   48 8b 40 08             mov    0x8(%rax),%rax
  400a94:   48 8b 40 08             mov    0x8(%rax),%rax
  400a98:   48 8b 40 08             mov    0x8(%rax),%rax
  400a9c:   48 8b 40 08             mov    0x8(%rax),%rax
  400aa0:   48 8b 70 08             mov    0x8(%rax),%rsi
  400aa4:   48 89 f0                mov    %rsi,%rax
  400aa7:   c3                      retq   
  400aa8:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  400aaf:   00 
  400ab0:   48 8b 46 08             mov    0x8(%rsi),%rax
  400ab4:   48 8b 40 08             mov    0x8(%rax),%rax
  400ab8:   48 8b 40 08             mov    0x8(%rax),%rax
  400abc:   48 8b 40 08             mov    0x8(%rax),%rax
  400ac0:   48 8b 40 08             mov    0x8(%rax),%rax
  400ac4:   eb ca                   jmp    400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400ac6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400acd:   00 00 00 
  400ad0:   48 8b 76 08             mov    0x8(%rsi),%rsi
  400ad4:   eb ce                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400ad6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400add:   00 00 00 
  400ae0:   48 8b 46 08             mov    0x8(%rsi),%rax
  400ae4:   48 8b 70 08             mov    0x8(%rax),%rsi
  400ae8:   eb ba                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400aea:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  400af0:   48 8b 46 08             mov    0x8(%rsi),%rax
  400af4:   48 8b 40 08             mov    0x8(%rax),%rax
  400af8:   48 8b 70 08             mov    0x8(%rax),%rsi
  400afc:   eb a6                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400afe:   66 90                   xchg   %ax,%ax
  400b00:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b04:   48 8b 40 08             mov    0x8(%rax),%rax
  400b08:   48 8b 40 08             mov    0x8(%rax),%rax
  400b0c:   48 8b 70 08             mov    0x8(%rax),%rsi
  400b10:   eb 92                   jmp    400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400b12:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  400b18:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b1c:   48 8b 40 08             mov    0x8(%rax),%rax
  400b20:   48 8b 40 08             mov    0x8(%rax),%rax
  400b24:   48 8b 40 08             mov    0x8(%rax),%rax
  400b28:   48 8b 70 08             mov    0x8(%rax),%rsi
  400b2c:   e9 73 ff ff ff          jmpq   400aa4 <_Z21ConstantNodeMultiJumpiP4Node+0x34>
  400b31:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400b38:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b3c:   e9 4f ff ff ff          jmpq   400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400b41:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400b48:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b4c:   48 8b 40 08             mov    0x8(%rax),%rax
  400b50:   e9 3b ff ff ff          jmpq   400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400b55:   0f 1f 00                nopl   (%rax)
  400b58:   48 8b 46 08             mov    0x8(%rsi),%rax
  400b5c:   48 8b 40 08             mov    0x8(%rax),%rax
  400b60:   48 8b 40 08             mov    0x8(%rax),%rax
  400b64:   e9 27 ff ff ff          jmpq   400a90 <_Z21ConstantNodeMultiJumpiP4Node+0x20>
  400b69:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

0000000000400b70 <_Z13NodeMultiJumpiP4Node>:
  400b70:   85 ff                   test   %edi,%edi
  400b72:   48 89 f0                mov    %rsi,%rax
  400b75:   7e 14                   jle    400b8b <_Z13NodeMultiJumpiP4Node+0x1b>
  400b77:   31 d2                   xor    %edx,%edx
  400b79:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400b80:   83 c2 01                add    $0x1,%edx
  400b83:   48 8b 40 08             mov    0x8(%rax),%rax
  400b87:   39 fa                   cmp    %edi,%edx
  400b89:   75 f5                   jne    400b80 <_Z13NodeMultiJumpiP4Node+0x10>
  400b8b:   f3 c3                   repz retq 
  400b8d:   0f 1f 00                nopl   (%rax)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.