packtpublishing / the-art-of-writing-efficient-programs Goto Github PK

View Code? Open in Web Editor NEW

253.0 11.0 65.0 91 KB

The Art of Writing Efficient Programs, published by Packt

License: MIT License

C 99.15% C++ 0.85%

the-art-of-writing-efficient-programs's Introduction

The Art of Writing Efficient Programs

This is the code repository for The Art of Writing Efficient Programs, published by Packt.

An advanced programmer's guide to efficient hardware utilization and compiler optimizations using C++ examples

What is this book about?

The great free lunch of "performance taking care of itself" is over. Until recently, programs got faster by themselves as CPUs were upgraded, but that doesn't happen anymore. The clock frequency of new processors has almost peaked. New architectures provide small improvements to existing programs, but this only helps slightly. Processors do get larger and powerful, but most of this new power is consumed by the increased number of processing cores and other “extra” computing units. To write efficient software, you now have to know how to program by making good use of the available computing resources, and this book will teach you how to do that.

This book covers the following exciting features:

Discover how to use the hardware computing resources in your programs effectively
Understand the relationship between memory order and memory barriers
Familiarize yourself with the performance implications of different data structures and organizations
Assess the performance impact of concurrent memory accessed and how to minimize it
Discover when to use and when not to use lock-free programming techniques
Explore different ways to improve the effectiveness of compiler optimizations
Design APIs for concurrent data structures and high-performance data structures to avoid inefficiencies

If you feel this book is for you, get your copy today!

Instructions and Navigations

All of the code is organized into folders. For example, Chapter02.

The code will look like the following:

std::vector<double> v;
… add data to v …
std::for_each(v.begin(), v.end(),[](double& x){ ++x; });

Following is what you need for this book: This book is for experienced developers and programmers who work on performance-critical projects and want to learn different techniques to improve the performance of their code. Programmers who belong to algorithmic trading, gaming, bioinformatics, computational genomics, or computational fluid dynamics communities can learn various techniques from this book and apply them in their domain of work. Although this book uses the C++ language, the concepts demonstrated in the book can be easily transferred or applied to other compiled languages such as C, Java, Rust, Go, and more.

With the following software and hardware list you can run all code files present in the book (Chapter 1-12). For more instructions on how to run the code, click here.

Software and Hardware List

Chapter	Software required	OS required
1-12	C++ compiler (GCC, Clang, Visual Studio, etc.)	Windows, Mac OS X, and Linux (Any)
1-12	Profiler (VTune, Perf, GoogleProf, etc.)	Windows, Mac OS X, and Linux (Any)
1-12	Benchmark Library (Google Bench)	Windows, Mac OS X, and Linux (Any)

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Errata

Page 25: The sentence "for example, if the actual string is generated from a simulation that takes ten hours, the one hundred seconds it takes to sort it is hardly worth noticing." must be read as "for example, if the actual string is generated from a simulation that takes ten hours, the one hundred milliseconds it takes to sort it is hardly worth noticing."
Page 45: The file name for the first code snippet must be read as 01a_compare_timer.C and the second code snippet must be read as 01a_compare_timer_a.C/01a_compare_timer_b.C.
Page 116: The sentence "As we have seen, under the right circumstances, the CPU can do several operations per second" must be read as "As we have seen, under the right circumstances, the CPU can do several operations per cycle".

Get to Know the Author

Fedor G. Pikus is a chief engineering scientist in the Mentor IC Segment of Siemens Digital Industries Software and is responsible for the long-term technical direction of Calibre products, the design and architecture of software, and research into new software technologies. His previous roles included senior software engineer at Google and chief software architect at Mentor Graphics. Fedor is a recognized expert in high-performance computing and C++. He has presented his works at CPPCon, SD West, DesignCon, and in software development journals, and is also an O'Reilly author. Fedor has over 25 patents and over 100 papers and conference presentations on physics, EDA, software design, and C++.

Download a free PDF

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.

https://packt.link/free-ebook/9781800208117

the-art-of-writing-efficient-programs's People

Contributors

Stargazers

Watchers

Forkers

dusong7 bgkyer utjfritz aakash10292 nuzhny007 lee-s dieterblocher seatalk manu-xdev flexxxman ilovejs dunli sfrias sonicyouth98 kaifsm mycpuorg andrecurvello corsair-cxs hsaligan pradeep119 ajunlonglive jackiedinh8 iandmozart feiyunwill nguyenhopthanh devcrafter louishust cuiyixin555 rkterungwa16 dharmesh1979 zabrane ususnc kobedeshow allensmile dreamplayer-zhang chinajayyyy happytigger xuesongzh awfeequdng elonvampire frankzheng0422 acmilanwangchao123 bijoumd78 slw-git jfkonecn d3nd3r guillaumemilitello pangzhiwang poisonbox saorisakura tonyfu97 magneticmartian shiyi23 nscooling mykhani linusboehm flamederwolf jiangshifu kanonwy ar61 suhasghorp zhaoqinghai seeingall2022 hmthanh bssrdf

the-art-of-writing-efficient-programs's Issues

It is unclear why 04_substring_sort_a.C is slower than 03_substring_sort_a.C

The book said that by the following chapter (Chapter 3) the reason why would have been clear, however I think that an explicit explanation is missing from the chapter, remaining opaque the reason why there is such a gap between indexing using an unsigned int versus a signed int.

Links:
04_substring_sort_a.C

03_substring_sort_a.C

Content missing...

File 11_compare_mbm_c.C has no content. Like to add a pull request, but the repo does not allow that.

Question about Chapter07 /02b_stack_cas.C

For push and pop function.

void push(const T& v) {
    // Reserve the slot for the new object by advancing the producer count.
    counts_t n = n_.load(std::memory_order_relaxed);
    if (n.pc_ == cap_) abort();
    int i = 0;
    while (!n.equal(n_) || !n_.compare_exchange_weak(n, {n.pc_ + 1, n.cc_}, std::memory_order_acquire, std::memory_order_relaxed)) {
        if (n.pc_ == cap_) abort();
        nanosleep(i);
    };
    // Producer count advanced, slot n.pc_ + 1 is ours.
    ++n.pc_;
    new (&s_[n.pc_]) T(v);

    // Advance the consumer count to match.
    if (!n_.compare_exchange_strong(n, {n.pc_, n.cc_ + 1}, std::memory_order_release, std::memory_order_relaxed)) abort();
}
std::optional<T> pop() {
    // Decrement the consumer count.
    counts_t n = n_.load(std::memory_order_relaxed);
    if (n.cc_ == 0) return std::optional<T>(std::nullopt);
    int i = 0;
    while (!n.equal(n_) || !n_.compare_exchange_weak(n, {n.pc_, n.cc_ - 1}, std::memory_order_acquire, std::memory_order_relaxed)) { 
        if (n.cc_ == 0) return std::optional<T>(std::nullopt);
        nanosleep(i);
    };
    // Consumer count decremented, slot n.cc_ - 1 is ours.
    --n.cc_;
    std::optional<T> res(std::move(s_[n.pc_]));
    s_[n.pc_].~T();

    // Decrement the producer count to match.
    if (!n_.compare_exchange_strong(n, {n.pc_ - 1, n.cc_}, std::memory_order_release, std::memory_order_relaxed)) abort();
    return res;
}

in
while (!n.equal(n_) || !n_.compare_exchange_weak(n, {n.pc_ + 1, n.cc_}, std::memory_order_acquire, std::memory_order_relaxed)) {
n.pc_ is already updated to n.pc_ + 1.
while (!n.equal(n_) || !n_.compare_exchange_weak(n, {n.pc_, n.cc_ - 1}, std::memory_order_acquire, std::memory_order_relaxed)) {
n.cc_ is already updated to n.cc_ -1.
why n.pc_ and n.cc_ updated again in the following:
++n.pc_;
--n.cc_;
are these update duplicate?

Chapter03 01_superscalar.C Illegal instruction (core dumped)

When compiling 01_superscalar.C with the flags listed in the book the following output appears:

$ clang++ -g -O3 -mavx2 -Wall -pedantic -I$GBENCH_DIR/include 01_superscalar.cpp $GBENCH_DIR/lib/libbenchmark.a -pthread -lrt -lm -o benchmark
$ ./benchmark

2023-11-04T17:51:30-05:00
Running ./benchmark
Run on (12 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.00, 0.00, 0.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Illegal instruction (core dumped)

When analyzing in GDB this is the output:

Program received signal SIGILL, Illegal instruction.
BM_add (state=...) at 01_superscalar.cpp:19
19        a1 += p1[i] + p2[i];

Here is the relevant information about my system I am running this on:

$ uname -a
Linux arch-server 6.5.9-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 26 Oct 2023 00:52:20 +0000 x86_64 GNU/Linux
$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
    CPU family:          6
    Model:               62
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            4
    CPU(s) scaling MHz:  40%
    CPU max MHz:         3900.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            6986.30
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
                          ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_ts
                         c cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 s
                         se4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb s
                         tibp tpr_shadow flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi md_clear flush
                         _l1d
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1.5 MiB (6 instances)
  L3:                    12 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

This is the code I am using:

 #include <stdlib.h>
 #include <string.h>
 #include <benchmark/benchmark.h>
 
 void BM_add(benchmark::State& state)
 {
     srand(1);
     const unsigned int N = state.range(0);
     std::vector<unsigned long> v1(N), v2(N);
     for(size_t i = 0; i < N; ++i){
         v1[i] = rand();
         v2[i] = rand();
     }
     unsigned long* p1 = v1.data();
     unsigned long* p2 = v2.data();
     for(auto _ : state){
         unsigned long a1 = 0;
         for(size_t i = 0; i < N; ++i){
             a1 += p1[i] + p2[i];
         }
         benchmark::DoNotOptimize(a1);
         benchmark::ClobberMemory();
     }
     state.SetItemsProcessed(N*state.iterations());
 }
 
 #define ARGS ->Arg(1 << 22)
 
 BENCHMARK(BM_add) ARGS;
 
 BENCHMARK_MAIN();