apress / pro-tbb Goto Github PK

Source Code for 'Pro TBB: C++ Parallel Programming with Threading Building Blocks' by Michael Voss, Rafael Asenjo, and James Reinders

Home Page: https://www.apress.com/us/book/9781484243978

License: Other

Makefile 3.92% C++ 95.26% Cool 0.10% C 0.73%

pro-tbb's Introduction

SourceCode

Source code of the examples provided in each chapter of the TBB book (2019).

Makefiles default to use of Intel C++ compiler, but specifying a CXX definition on the command line can change the compilation to be with Microsoft (CXX=cl), XCode (CXX=clang++), or Gnu C++ (CXX=g++).

TBB, which is needed of course, can be installed as part of the Intel compiler install, or obtained separately.

The following commands worked for us prior to the release of the book: on MacOS and Linux: make on MacOS and Linux: make CXX=g++ on Windows: nmake /f Makefile.nmake on Windows: nmake /f Makefile.nmake CXX=cl (the latter requires use of vcvars64.bat (Visual Studio), and setting of INCLUDE environment variable to include the windows\tbb\include and windows\pstl\include directories in your TBB installation, and setting of LIB environment variable to search the tbb\lib\ia32_win\vc_mt and tbb\lib\intel64_win\vc_mt directories.)

Linux or Mac builds with: make (clean up with: make clean)

Windows build with: nmake /f Makefile.nmake (clean up with: nmake /f Makefile.nmake clean)

Ch18 and Ch19 (Chapters 18-19) require an OpenCL SDK to be installed (vendor of choice). Khronos maintains a list at https://www.khronos.org/conformance/adopters/conformant-products/opencl Intel's SDK is at https://software.intel.com/intel-opencl NVidia's SDK is at https://developer.nvidia.com/opencl

Ch20 (Chapter 20) require that the hwloc library to be installed. https://www.open-mpi.org/projects/hwloc/

Feel free to leave comments/suggestions/feedback on the git project website for the book.

Mike, Rafa, and James

pro-tbb's People

Contributors

Stargazers

Watchers

Forkers

techlord-rce jacksky64 rasenjop naushad-rahman jbelloncastro evegen55 markccchiang owenschris djson mortal2000 dvp2015 seasky100 rdguerrerom john-coster wukaken dreamplayer-zhang hikmatfarhat-ndu leannejdong zeta1999 dkuo123 dreamplayerzhang hesamshaelaie markusbuchholz rajasabd atomqu pschneider1968 azurecloudmonk bijoumd78 zhangasia gevarakelyan sonicyouth98 wontae-lee freedomkkk calinteodorescu nyue junqicao kkwork00 hosam-morad jammy-li isabella232 profcab subnr01 astroc86 svetlanasieber webstorage119

pro-tbb's Issues

improve graph examples by adding logging in all examples

I noticed that graph example 3_10 has no logging. The logs are really useful for understanding and validating the relationships. Also, the ability to have delays inserted in the nodes helps validate that many orders of signal arrivals are handled properly.

In 3_10, adding relatively long delays in the source node help you understand what would be the expected graph processing if working on a live stream ... where frames arrive at a relatively long and relatively fixed interval.

I'm going to attach a fig_3_10 example with logging and delays added.

fig_3_10.zip

use of non-threadsafe std::cout in examples, such as fig_3_03

The examples often use std::cout inside tasks, and the output can get pretty jumbled.
For example, fig_3_03 output of

std::cout << "first node received: " << in << std::endl;

I've been changing these to, for example
`
std::stringstream ss;

  ss << "first node received: " << in << std::endl; 

  std::cout << ss.str();

`
This seems to work, although it isn't clear to me that even it is guaranteed.
So, I'm adding a cout_locked function that should be guaranteed in my examples

`
tbb::spin_mutex mylock;

void cout_locked(const std::string &ss)

{

tbb::spin_mutex::scoped_lock smut(mylock);

std::cout << ss;

}
`
and so the code above becomes

`
std::stringstream ss;

  ss << "first node received: " << in << std::endl;

  //std::cout << ss.str();

  cout_locked(ss.str());

The problem with the current output is easy to see if the fig_3_03 example is modified to send more messages in the test

`
// step 4: send messages

for (int i=0;i<100;i++)

my_first_node.try_put(i);

example fig_3_16 creates too many nodes

The example fig_3_16 creates node pointers for the full NxN matrix, even though the Figure 3-15 example shows that nodes are not needed above the diagonal. I see that the nodes aren't created for each pointer, but is the sparse vector necessary?

Also, the block_size parameter is not used in addEdges of fig_3_16. It is not clear if this is an error, but it needs an explanation.

I see that if this example is modified so that N is not evenly divisible by the block_size, then the results are incorrect vs the gold results, it looks like there needs to be some explanation of that restriction in the text.

In general, it would have been better to have the fig_3_16 duplicate the prior example diagrams with 8x8 triangle dependencies and with block sizes of 1x1 and 2x2, as in the prior examples so that it would be clear how the dependency graph maps to those.

parallel_scan example with double data is not deterministic

I modified fig_2_14 example to test if operations on double data are deterministic, since I didn't find anything similar to parallel_deterministic_reduce.

The result of the test indicates repeat of a parallel_scan example with operations on double data results in different return values from the parallel_scan example doing sums.

I'll attach my example.

fig_2_14_double.txt

serialQuicksort in fig_2_03 is buggy

I noticed that by running sorted data back through the serialQuicksort in fig_2_03, I either get crashes or very long sort times.

I adapted an iterator quicksort from a solution on stackoverflow, and it fixes the problems. There are discussions about solutions for quicksort on sorted data elsewhere. I'll attach my adapted replacement
quicksort_fix.txt
.

Apress Pro-TBB

error: no matching function for call to ‘for_each Chapter1/fig_1_05.cpp

I can't build pstl library so that i use prebuilt from intel: https://github.com/intel/tbb/releases

I replace 2 line from:
#include <pstl/algorithm>
#include <pstl/execution>

to
#include <pstl/experimental/algorithm>
#include <pstl/internal/algorithm_impl.h>

Error i encounter is:
figure_1_05.cpp: In function ‘int main()’:
figure_1_05.cpp:10:3: error: no matching function for call to ‘for_each(const pstl::execution::v1::parallel_policy&, std::vector<std::__cxx11::basic_string<char> >::iterator, std::vector<std::__cxx11::basic_string<char> >::iterator, main()::<lambda(std::__cxx11::string&)>)’ ); ^ In file included from /usr/include/c++/5/algorithm:62:0, from /usr/local/include/pstl/experimental/internal/reduction_impl.h:20, from /usr/local/include/pstl/experimental/internal/reduction.h:23, from /usr/local/include/pstl/experimental/algorithm:21, from figure_1_05.cpp:1: /usr/include/c++/5/bits/stl_algo.h:3761:5: note: candidate: template<class _IIter, class _Funct> _Funct std::for_each(_IIter, _IIter, _Funct) for_each(_InputIterator __first, _InputIterator __last, _Function __f) ^ /usr/include/c++/5/bits/stl_algo.h:3761:5: note: template argument deduction/substitution failed: figure_1_05.cpp:10:3: note: deduced conflicting types for parameter ‘_IIter’ (‘pstl::execution::v1::parallel_policy’ and ‘__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char>*, std::vector<std::__cxx11::basic_string<char> > >’) );

I use Linux: 16.04, g++ : 5.5.0
I'm not C++ expert. Does anyone have any idea what is this error and how can i fix it?

improve graph examples by adding random delays in task executions, example fig_3_05

The graph examples don't answer questions about what happens if inputs arrive out of order. The initial graph example logging, such as in fig_3_05 is also easier to understand if tbb::flow::serial is used instead of unlimited for the nodes. I'm adding random 1 to 10msec work delays as the inputs are applied to help my understanding of the issues, calling a spin_msec function to insert the delays.

`
void spin_msecs( int msecs) {

tbb::tick_count t0 = tbb::tick_count::now();

while ((tbb::tick_count::now() - t0).seconds()*1000.0 < msecs);

}
`

I'm calling with random generation

std::minstd_rand simple_rand;
simple_rand.seed(42);
`
then calling with

`
spin_msecs(simple_rand()%10+1);

`
I'm attaching my modified fig_3_05.cpp code as an example.

fig_3_05.zip

Also, after uploading this, I added a log after the step 4, prior to the wait_for_all to understand that the try_put calls are not blocking.

`
{
std::stringstream ss;

ss << "try_puts done. proceeding to wait_for_all" << std::endl;

cout_locked(ss.str());

}
`

fig_2_14 parallel_scan example code fails for non_zero v[0]

Using the book example code, I modified the initial values of input vector v to be i+2 instead of i. Running the serial sum returns 44 as last value, while the parallel_scan sum returns 42.

Fix for the fig_2_14 function which contains the parallel_scan is to remove the rsum[0] = v[0] an change the blocked_range to (0, N).

With those changes I get matching result for serial_rsum and parallel_rsum vectors

`
For N=8
input 2,3,4,5,6,7,8,9

combine (x=0,y=0)
sum[0..1), fs=True, 2
sum[2..3), fs=False, 4
sum[3..4), fs=False, 5
sum[6..7), fs=False, 8
combine(x=4,y=5)
sum[5..6), fs=False, 7
sum[1..2), fs=False, 3
sum[4..5), fs=False, 6
combine(x=2,y=3)
combine(x=6,y=7)
combine(x=5,y=9)
combine(x=14,y=13)
sum[1..2), fs=True, 5
combine(x=5,y=4)
sum[3..4), fs=True, 14
combine(x=14,y=6)
sum[2..3), fs=True, 9
combine(x=27,y=8)
sum[4..5), fs=True, 20
sum[7..8), fs=True, 44
sum[5..6), fs=True, 27
sum[6..7), fs=True, 35

parallel_sum = 2,5,9,14,20,27,35,44
`

makePrimesTree in fig_2_18 and fig_2_19 fails to populate tree with all values

makePrimesTreeElem needs to have IntVector::iterator &i as last parameter and makePrimesTree needs to pass in a reference to copy of vec.begin() that can be modified.
`IntVector::iterator it = vec.begin();

currently the iterator parameter is constant, and it fills the tree with duplicates of some data.

I'll attach a copy of fig_2_19 that has this fixed. The same fix needs to go into fig_2_18.

fig_2_19b.txt

add comparison of serial and parallel speedup for fig_2_19

fig_2_19 provided the parallel_do_feeder for the tree handling, but didn't have a comparison with serial time, as in other examples.

I added the speed-up comparison vs serial time. Also balanced the data initialization so half are primes. I'm attaching the modified code.

serial_time == 1.72244 seconds parallel_time == 0.374743 seconds speedup == 4.59634
fig_2_19.txt

parallel_invoke is not using variadic templates in the example builds

Book indicates that parallel_invoke is unlimited due to use of variadic templates, however it appears these are not enabled as default. If you go above 10 lambdas it gets an error. btw, it might be worth updating the fig_1_04.cpp example as I did in the attached file to make it clear that the lambdas don't necessarily execute in a sequential order in the parallel_invoke.

I'm attaching a modified fig_1_04 example with 10 lambdas. If I add one more lambda, it gets a build error.

fig_1_04.zip

parallel_deterministic_reduce example from fig_2_12 modification

chapter 2 mentioned parallel_deterministic_reduce. It didn't have an example or mention trade-off of execution time. I modified the pi estimation example to demonstrate the variable results of parallel_reduce and to determine the execution time penalty of substituting parallel_deterministic_reduce.

parallel_deterministic_reduce was too slow for the original example interval count.
output for 10,000,000 intervals is :

`
parallel pi == 3.1415926536006751
parallel pi == 3.1415926536006289
parallel pi == 3.1415926536006373
parallel pi == 3.1415926536006298
parallel_time == 0.015055519000000002 seconds
parallel pi == 3.1415926536006848
parallel pi == 3.1415926536006848
parallel pi == 3.1415926536006848
parallel pi == 3.1415926536006848
parallel_time_d == 1.854477962 seconds

so, it does demonstrate varying results for the first four ... which are using parallel_reduce,
and it does show the parallel_deterministic_reduce time being 100x slower.
`
fig_2_12.txt

fig_2_14 parallel_scan example N size too small to show speed-up

parallel_scan example fig_2_14 has N=1e4, which is too small to show a speed-up on my computer ( 4 core).
code is written so that changing N to 1e6 fails to compile, due to compile time expression failure.

I rewrote code to use std::size_t instead of int for all the parameters, and was then able to compile and get a result that shows speed-up. I'm attaching the code change.

serial_time == 0.00814126 seconds parallel_time == 0.00304168 seconds speedup == 2.67657
fig_2_14.txt

several build errors

I cloned these examples and attemped to build with g++ on ubuntu 18.04LTS. There were a number of errors. I think I resolved the path issues. I sourced the attached "doit.txt" command to set up the environment and build all the examples. I'm attaching doit.log which shows the successes and errors.

doit.txt

doit.log

parallel_scan, number of calls to combine_body not deterministic for N=800

I was hoping the parallel_scan operation would be deterministic, or at least have a deterministic version option similar to parallel_deterministic_reduce.

I initially saw that the number of calls to combine_body in the fig_2_14 example varies between 1 and 16 times when N=800 on my 4 core, 8 thread x86, but this was my misinterpretation. Apparently vscode debugger resumes all threads with a single resume.

I can add scoped_lock at scope start of scan body and combine body to see that all blocks are called. An example of the calls with N=8 is a bit more interesting vs the fragment shown in the book.

`
For N=8
combine (x=0,y=0)
sum[1..2), fs=True, 1
sum[3..4), fs=False, 3
sum[6..7), fs=False, 6
sum[5..6), fs=False, 5
sum[2..3), fs=False, 2
sum[4..5), fs=False, 4
combine(x=2,y=3)
combine(x=1,y=5)
combine(x=4,y=5)
combine(x=1,y=2)
combine(x=6,y=9)
sum[2..3), fs=True, 3
combine(x=6,y=4)
sum[5..6), fs=True, 15
combine(x=15,y=6)
sum[3..4), fs=True, 6
sum[7..8), fs=True, 28
sum[6..7), fs=True, 21
sum[4..5), fs=True, 10

parallel_sum = 0,1,3,6,10,15,21,28
`
However, the above sequence is incorrect, since the interval [1..2) is called with final_sum=True on first execution. This creates an error result if v[0]!=0. The interval[0..1) is the one that needs to be called first with final_sum=True. I'm adding a separate issue to make that explicit.

fig_3_16 dependency graph processing is doing x[i] calculations twice

I added logging to fig_3_16 dependency graph example and exercised it with several different block sizes. The block where it computes the x[i] values is sometimes being executed twice. Below is a log for block_size=1 and N=8 that shows the "calc" executions occurring twice. It also has multiple calc executions when block_size=2, so it isn't a block_size=1 issue.

r:c= calc x[0]=0,
r:c=0:1, calc x[1]=0.5,
r:c= calc x[1]=0.5,
r:c=0:2,
r:c=0:3,
r:c=0:4,
r:c=1:2, calc x[2]=0.5,
r:c=1:3,
r:c=0:5,
r:c=1:4,
r:c= calc x[2]=0.5,
r:c=0:6,
r:c=1:5,
r:c=2:3, calc x[3]=0.35,
r:c=1:6,
r:c=0:7,
r:c=2:4,
r:c=2:5,
r:c=2:6,
r:c= calc x[3]=0.35,
r:c=1:7,
r:c=2:7,
r:c=3:4, calc x[4]=0.261765,
r:c=3:5,
r:c=3:6,
r:c= calc x[4]=0.261765,
r:c=3:7,
r:c=4:5, calc x[5]=0.207805,
r:c= calc x[5]=0.207805,
r:c=4:6,
r:c=4:7,
r:c=5:6, calc x[6]=0.171998,
r:c=5:7,
r:c= calc x[6]=0.171998,
r:c=6:7, calc x[7]=0.146639,
r:c= calc x[7]=0.146639,

I'm attaching the modified fig_3_16 example with logging and random delays inserted.

fig_3_16.zip

parallel_for ch02, fig_2_06 is calling copy constructor on vector

Stepping through the fig_2_06 example code in the debugger, I see it calling copy constructor on the vector.

The example would make more sense if it modified the contents of the vector, which would then also let you verify the result of the parallel operation.

with the change of the code to modify the vector contents and avoid the copy constructor, it then uses the template below, which makes more sense for the chapter example.

//! Parallel iteration over a range of integers with a step provided and default partitioner template <typename Index, typename Function> void parallel_for(Index first, Index last, Index step, const Function& f) { parallel_for_impl<Index,Function,const auto_partitioner>(first, last, step, f, auto_partitioner()); }

mistake in text on p60 of 2019_Book_ProTBB.pdf

I believe
"avoiding the sequential task spawning limitations of parallel_do."

on p60 of 2019_Book_ProTBB.pdf should be

"avoiding the sequential task spawning limitations of parallel_for."