{"id":5195,"date":"2017-02-19T09:13:48","date_gmt":"2017-02-19T09:13:48","guid":{"rendered":"https:\/\/www.modernescpp.com\/index.php\/parallel-algorithm-of-the-standard-template-library\/"},"modified":"2017-02-19T09:13:48","modified_gmt":"2017-02-19T09:13:48","slug":"parallel-algorithm-of-the-standard-template-library","status":"publish","type":"post","link":"https:\/\/www.modernescpp.com\/index.php\/parallel-algorithm-of-the-standard-template-library\/","title":{"rendered":"Parallel Algorithms of the Standard Template Library"},"content":{"rendered":"<p>The idea is quite simple. The Standard Template has more than 100 algorithms for searching, counting, and manipulating ranges and their elements. With C++17, 69 are overloaded, and a few new ones are added. The overloaded and new algorithms can be invoked with a so-called execution policy. Using the execution policy, you can specify whether the algorithm should run sequentially, parallel, or parallel and vectorized.<\/p>\n<p>&nbsp;<\/p>\n<p><!--more--><\/p>\n<h2>A first example<\/h2>\n<p>Vectorization stands for the <a href=\"https:\/\/en.wikipedia.org\/wiki\/SIMD\">SIMD<\/a> (<strong>S<\/strong>ingle <strong>I<\/strong>nstruction, <strong>M<\/strong>ultiple <strong>D<\/strong>ata) extensions of the instruction set of a modern processor. SIMD enables your processor to execute one operation in parallel on several data.<\/p>\n<p>The policy tag chooses which overloaded variant of an algorithm is used. How does that work?<\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\">vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; v = ...\r\n\r\n<span style=\"color: #008000;\">\/\/ standard sequential sort<\/span>\r\nstd::sort(v.begin(), v.end());\r\n\r\n<span style=\"color: #008000;\">\/\/ sequential execution<\/span>\r\nstd::sort(std::parallel::seq, v.begin(), v.end());\r\n\r\n<span style=\"color: #008000;\">\/\/ permitting parallel execution<\/span>\r\nstd::sort(std::parallel::par, v.begin(), v.end());\r\n\r\n<span style=\"color: #008000;\">\/\/ permitting parallel and vectorized execution<\/span>\r\nstd::sort(std::parallel::par_unseq, v.begin(), v.end());\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The example shows you can even use the classic variant of <span style=\"font-family: courier new,courier;\">std::sort<\/span> (line 4). On contrary, you explicitly specify in C++17 whether the sequential (line 7), parallel (line 10) or parallel and vectorized (line 13) version is used.<\/p>\n<p>You have to keep two points in mind.<\/p>\n<\/p>\n<h2>Two special points<\/h2>\n<p>On the one hand, an algorithm will not always be performed parallel and vectorized if you use the execution policy <em><\/em><span style=\"font-family: courier new,courier;\">std::parallel<\/span><span style=\"font-family: courier new,courier;\">::par_unseq.<\/span> On the other hand, you, as the user, are responsible for correctly using the algorithm.<\/p>\n<h3>Parallel and vectorized execution<\/h3>\n<p>Whether an algorithm runs in a parallel and vectorized way depends on many points. It depends if the CPU and the operating system support SIMD instructions. Additionally, it&#8217;s a question of the compiler and the optimization level you used to translate your code.<\/p>\n<p>&nbsp;<\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #0000ff;\">const<\/span> <span style=\"color: #2b91af;\">int<\/span> SIZE= 8;\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> vec[]={1,2,3,4,5,6,7,8};\r\n<span style=\"color: #2b91af;\">int<\/span> res[SIZE]={0,};\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n  <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #2b91af;\">int<\/span> i= 0; i &lt; SIZE; ++i) {\r\n    res[i]= vec[i]+5;\r\n  }\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Line 8 is the key line in the small program. Thanks to the compiler explorer <a href=\"https:\/\/godbolt.org\/\">https:\/\/godbolt.org\/ <\/a>, it is quite easy to generate the assembler instructions for clang 3.6 with and without maximum optimization (-03).&nbsp;<a href=\"https:\/\/godbolt.org\/\"> <\/a><\/p>\n<h4>Without Optimisation<\/h4>\n<p>Although my time fiddling with assembler instruction is long gone, it&#8217;s evident that all is done sequentially.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5192\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/seq.png\" alt=\"seq\" style=\"margin: 15px;\" width=\"282\" height=\"93\" \/><\/p>\n<h4>With maximum optimization<\/h4>\n<p>I get instructions that run in parallel on several data by using maximum optimization.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5193\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/vec.png\" alt=\"vec\" style=\"margin: 15px;\" width=\"468\" height=\"138\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/vec.png 468w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/vec-300x88.png 300w\" sizes=\"auto, (max-width: 468px) 100vw, 468px\" \/><\/p>\n<p>The move operation <span style=\"font-family: courier new,courier;\">(movdqa)<\/span> and the add operation <span style=\"font-family: courier new,courier;\">(paddd) <\/span>use the special registers <span style=\"font-family: courier new,courier;\">xmm0<\/span> and <span style=\"font-family: courier new,courier;\">xmm1.<\/span> Both registers are so-called SSE registers and have 128 Bits. <a href=\"https:\/\/en.wikibooks.org\/wiki\/X86_Assembly\/SSE\">SSE<\/a> stands f\u00fcr <strong>S<\/strong>treaming <strong>S<\/strong>IMD <strong>E<\/strong>xtensions.&nbsp;<\/p>\n<h3>Hazards of data races and deadlocks<\/h3>\n<p>The algorithm does not automatically protect from <a href=\"https:\/\/www.modernescpp.com\/index.php\/threads-sharing-data\">data races<\/a> and <a href=\"https:\/\/www.modernescpp.com\/index.php\/the-risk-of-mutexes\">deadlocks<\/a>. Example?<\/p>\n<p>&nbsp;<\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #2b91af;\">int<\/span> numComp= 0;\r\n\r\nstd::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; vec={1,3,8,9,10};\r\n\r\nstd::sort(std::parallel::vec, vec.begin(), vec.end(), \r\n         [&amp;numComp](<span style=\"color: #2b91af;\">int<\/span> fir, <span style=\"color: #2b91af;\">int<\/span> sec){ numComp++; <span style=\"color: #0000ff;\">return<\/span> fir &lt; sec; });\r\n         \r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The small code snippet has a data race. <span style=\"font-family: courier new,courier;\">numComp<\/span> counts the number of operations. That means, in particular, that <span style=\"font-family: courier new,courier;\">numComp<\/span> is modified in the lambda function; therefore, the code snippet has a data race. To be well-defined, <span style=\"font-family: courier new,courier;\">numComp<\/span> has to be an atomic variable.<\/p>\n<h2>The static versus dynamic execution policy<\/h2>\n<p><strong><em>Sorry to say, but the execution policy made it not in the C++17 standard. We have to wait for C++20.<\/em><\/strong><\/p>\n<p>Creating a thread is expensive; therefore, it makes no sense to sort a small container in a parallel (<span style=\"font-family: courier new,courier;\">std::parallel::par<\/span>) or parallel and vectorised (<span style=\"font-family: courier new,courier;\">std::parallel:par_unseq<\/span>) fashion. The administrative overhead of dealing with threads will outweigh the benefit of parallelization. It even gets worse if you use a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Divide_and_conquer_algorithm\">divide and conquer algorithm<\/a> such as quicksort.<\/p>\n<p>&nbsp;<\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #0000ff;\">template<\/span> &lt;<span style=\"color: #0000ff;\">class<\/span> <span style=\"color: #2b91af;\">ForwardIt<\/span>&gt;\r\n<span style=\"color: #2b91af;\">void<\/span> quicksort(ForwardIt first, ForwardIt last){\r\n  <span style=\"color: #0000ff;\">if<\/span>(first == last) <span style=\"color: #0000ff;\">return<\/span>;\r\n  <span style=\"color: #0000ff;\">auto<\/span> pivot = *std::next(first, std::distance(first,last)\/2);\r\n  ForwardIt middle1 = std::partition(std::parallel::par, first, last, \r\n                      [pivot](<span style=\"color: #0000ff;\">const<\/span> <span style=\"color: #0000ff;\">auto<\/span>&amp; em){ <span style=\"color: #0000ff;\">return<\/span> em &lt; pivot; });\r\n  ForwardIt middle2 = std::partition(std::parallel::par, middle1, last, \r\n                      [pivot](<span style=\"color: #0000ff;\">const<\/span> <span style=\"color: #0000ff;\">auto<\/span>&amp; em){ <span style=\"color: #0000ff;\">return<\/span> !(pivot &lt; em); });\r\n  quicksort(first, middle1);\r\n  quicksort(middle2, last);\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The issue is that the number of threads is too big for your system. To solve this issue, C++17 supports a dynamic execution policy.<\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\">std::<span style=\"color: #2b91af;\">size_t<\/span> threshold= ...;  <span style=\"color: #008000;\">\/\/ some value <\/span>\r\n\r\n<span style=\"color: #0000ff;\">template<\/span> &lt;<span style=\"color: #0000ff;\">class<\/span> <span style=\"color: #2b91af;\">ForwardIt<\/span>&gt;\r\n<span style=\"color: #2b91af;\">void<\/span> quicksort(ForwardIt first, ForwardIt last){\r\n  <span style=\"color: #0000ff;\">if<\/span>(first == last) <span style=\"color: #0000ff;\">return<\/span>;\r\n  std::<span style=\"color: #2b91af;\">size_t<\/span> distance= std::distance(first,last);\r\n  <span style=\"color: #0000ff;\">auto<\/span> pivot = *std::next(first, distance\/2);\r\n\r\n  std::parallel::execution_policy exec_pol= std::parallel::par;\r\n  <span style=\"color: #0000ff;\">if<\/span> ( distance &lt; threshold ) exec_pol= std::parallel_execution::seq;\r\n\r\n  ForwardIt middle1 = std::partition(exec_pol, first, last, \r\n                      [pivot](<span style=\"color: #0000ff;\">const<\/span> <span style=\"color: #0000ff;\">auto<\/span>&amp; em){ <span style=\"color: #0000ff;\">return<\/span> em &lt; pivot; });\r\n  ForwardIt middle2 = std::partition(exec_pol, middle1, last, \r\n                      [pivot](<span style=\"color: #0000ff;\">const<\/span> <span style=\"color: #0000ff;\">auto<\/span>&amp; em){ <span style=\"color: #0000ff;\">return<\/span> !(pivot &lt; em); });\r\n  quicksort(first, middle1);\r\n  quicksort(middle2, last);\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>I use in lines 9 and 10 the dynamic execution policy. By default, quicksort will run in parallel (line 9). If the length of the range is smaller than a given <span style=\"font-family: courier new,courier;\">threshold<\/span> (line 1), quicksort will run sequentially (line 10).<\/p>\n<h2>All algorithms<\/h2>\n<p>69 of the algorithms of the STL support a parallel or a parallel and vectorized execution. Here they are.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-5194\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/ParallelAlgorithmn.png\" alt=\"ParallelAlgorithmn\" width=\"959\" height=\"626\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/ParallelAlgorithmn.png 959w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/ParallelAlgorithmn-300x196.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2017\/02\/ParallelAlgorithmn-768x501.png 768w\" sizes=\"auto, (max-width: 959px) 100vw, 959px\" \/><\/p>\n<p>In addition, we get eight new algorithms.<\/p>\n<h3>New algorithms<\/h3>\n<p>The new variation of&nbsp; <span style=\"font-family: courier new,courier;\">std::for_each <\/span>and the new algorithms<span style=\"font-family: courier new,courier;\"> std::for_each_n, std::exclusive_scan, std::inclusive_scan, std::transfom_exclusive_scan<\/span> , <span style=\"font-family: courier new,courier;\">std::transform_inclusive_scan,&nbsp;<span style=\"font-family: courier new,courier;\">std::reduce<\/span> <\/span><span style=\"font-family: courier new,courier;\">and<\/span><span style=\"font-family: courier new,courier;\"> <span style=\"font-family: courier new,courier;\">std::transform_reduce<\/span> <\/span>are in the <span style=\"font-family: courier new,courier;\">std<\/span> namespace. <span style=\"font-family: courier new,courier;\"><\/span><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<pre style=\"margin: 0; line-height: 125%;\">std::for_each\r\nstd::for_each_n\r\nstd::exclusive_scan\r\nstd::inclusive_scan\r\nstd::transform_exclusive_scan\r\nstd::transform_inclusive_scan\r\nstd::reduce\r\nstd::transform_reduce<br \/>\r\n<\/pre>\n<\/div>\n<p>Let&#8217;s have closer look at<span style=\"font-family: courier new,courier;\"> std::transform_reduce.<\/span><\/p>\n<h3>transform becomes map<\/h3>\n<p>The Haskell known function <span style=\"font-family: courier new,courier;\">map<\/span> is called <span style=\"font-family: courier new,courier;\">std::transform<\/span> in C++.&nbsp; If that is not a broad hint. When substituting the name <span style=\"font-family: courier new,courier;\">std::transform_reduce<\/span><span> <span style=\"font-family: courier new,courier;\"><\/span>transform with map, I will get&nbsp;<\/span><span style=\"font-family: courier new,courier;\"><\/span><span style=\"font-family: courier new,courier;\">std::map_reduce. <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/MapReduce\">MapReduce <\/a>is the well-known parallel framework that maps each value to a new value in the first phase and reduces all values to the result in the second phase.<span style=\"font-family: courier new,courier;\"><br \/><\/span><\/p>\n<p>The algorithm is directly applicable in C++17. Of course, my algorithm will not run on a big, distributed system, but the strategy is the same; therefore, I map in the map phase each word to its length and reduce in the reduce phase the lengths of all words to their sum. The result is the sum of the lengths of all words.<\/p>\n<p>&nbsp;<\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<pre style=\"margin: 0; line-height: 125%;\">std::vector&lt;std::string&gt; str{<span style=\"color: #a31515;\">\"Only\"<\/span>,<span style=\"color: #a31515;\">\"for\"<\/span>,<span style=\"color: #a31515;\">\"testing\"<\/span>,<span style=\"color: #a31515;\">\"purpose\"<\/span>};\r\n\r\n\r\nstd::<span style=\"color: #2b91af;\">size_t<\/span> result= std::transform_reduce(std::parallel::par, \r\n                         str.begin(), str.end(), \r\n                         [](std::string s){ <span style=\"color: #0000ff;\">return<\/span> s.length(); }, \r\n                         0, [](std::<span style=\"color: #2b91af;\">size_t<\/span> a, std::<span style=\"color: #2b91af;\">size_t<\/span> b){ <span style=\"color: #0000ff;\">return<\/span> a + b; });\r\n\r\nstd::cout &lt;&lt; result &lt;&lt; std::endl;      <span style=\"color: #008000;\">\/\/   21<\/span>\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The 0 is the initial element for the reduction.<\/p>\n<h2>What&#8217;s next?<\/h2>\n<p>With the <a href=\"https:\/\/www.modernescpp.com\/index.php\/atomic-smart-pointers\">next pos<\/a>t, I will go three years further. In C++20, we get atomic smart pointers. Following their C++11 pendants, they are called <span style=\"font-family: courier new,courier;\">std::atomic_shared_ptr<\/span> and <span style=\"font-family: courier new,courier;\">std::atomic_weak_ptr.&nbsp;<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The idea is quite simple. The Standard Template has more than 100 algorithms for searching, counting, and manipulating ranges and their elements. With C++17, 69 are overloaded, and a few new ones are added. The overloaded and new algorithms can be invoked with a so-called execution policy. Using the execution policy, you can specify whether [&hellip;]<\/p>\n","protected":false},"author":21,"featured_media":5192,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[367],"tags":[],"class_list":["post-5195","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-multithreading-c-17-and-c-20"],"_links":{"self":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/comments?post=5195"}],"version-history":[{"count":0,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/5195\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media\/5192"}],"wp:attachment":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media?parent=5195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/categories?post=5195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/tags?post=5195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}