{"id":6184,"date":"2021-07-22T20:48:10","date_gmt":"2021-07-22T20:48:10","guid":{"rendered":"https:\/\/www.modernescpp.com\/index.php\/performance-of-the-parallel-stl-algorithmn\/"},"modified":"2023-06-26T09:28:41","modified_gmt":"2023-06-26T09:28:41","slug":"performance-of-the-parallel-stl-algorithmn","status":"publish","type":"post","link":"https:\/\/www.modernescpp.com\/index.php\/performance-of-the-parallel-stl-algorithmn\/","title":{"rendered":"Performance of the Parallel STL Algorithms"},"content":{"rendered":"<p>In my last post, &#8220;<a href=\"https:\/\/www.modernescpp.com\/index.php\/parallel-algorithms-of-the-stl-with-gcc\">Parallel Algorithms of the STL with the GCC Compiler<\/a>&#8220;, I presented the necessary theory about the C++17 algorithm. Today, I made a performance test using the Microsoft and GCC compiler to answer the simple question: Does the execution policy pay off?<\/p>\n<p><!--more--><\/p>\n<p>&nbsp;<img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-6176\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/timelineParallelSTL.png\" alt=\"\" width=\"650\" height=\"249\" style=\"display: block; margin-left: auto; margin-right: auto;\" data-alt=\"timelineParallelSTL\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/timelineParallelSTL.png 927w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/timelineParallelSTL-300x115.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/timelineParallelSTL-768x294.png 768w\" sizes=\"auto, (max-width: 650px) 100vw, 650px\" \/><\/p>\n<p>The reason for the short detour about my template post is the following. I recognized that GCC supports my favorite C++17 feature: the parallel algorithms of the Standard Template Library. I present in this post the brand-new GCC 11.1, but a GCC 9 should also be acceptable. You have to install a few additional libraries to use the parallel STL algorithms with the GCC.<\/p>\n<h2>Threading Building Blocks<\/h2>\n<p>The GCC uses the Intel Thread Building blocks (TBB) under the hood. The TBB&nbsp;is a C++ template library developed by Intel for parallel programming on multi-core processors.<\/p>\n<p>To be precise, you need TBB 2018 version or higher. When I installed the developer package of the TBB on my Linux desktop (<a href=\"https:\/\/www.suse.com\/\">Suse)<\/a>, the package manager also chose the TBB memory allocator. Using the TBB is easy. You have to link against the TBB using the flag <code>-ltbb.<\/code><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-6181\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/buildGcc.png\" alt=\"buildGcc\" width=\"600\" height=\"108\" style=\"display: block; margin-left: auto; margin-right: auto;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/buildGcc.png 997w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/buildGcc-300x54.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/buildGcc-768x138.png 768w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Now, I&#8217;m prepared to take my next steps with parallel algorithms. Here are the first numbers using the Microsoft Compiler 19.16 and GCC 11.1.<\/p>\n<\/p>\n<h2>Performance Numbers with the Microsoft Compiler and the GCC Compiler<\/h2>\n<p>The following program <code>parallelSTLPerformance.cpp<\/code> calculates the tangents with the sequential (1), parallel (2), and parallel and vectorized (3) execution policy.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #f0f3f3; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<pre style=\"margin: 0px; line-height: 125%;\"><span style=\"color: #0099ff; font-style: italic;\">\/\/ parallelSTLPerformance.cpp<\/span>\r\n\r\n<span style=\"color: #009999;\">#include &lt;algorithm&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;cmath&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;execution&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;string&gt;<\/span>\r\n<span style=\"color: #009999;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #007788; font-weight: bold;\">long<\/span> <span style=\"color: #007788; font-weight: bold;\">long<\/span> size <span style=\"color: #555555;\">=<\/span> <span style=\"color: #ff6600;\">500<\/span><span style=\"color: #aa0000; background-color: #ffaaaa;\">'<\/span><span style=\"color: #ff6600;\">000<\/span><span style=\"color: #aa0000; background-color: #ffaaaa;\">'<\/span><span style=\"color: #ff6600;\">000<\/span>;  \r\n  \r\n<span style=\"color: #006699; font-weight: bold;\">const<\/span> <span style=\"color: #007788; font-weight: bold;\">double<\/span> pi <span style=\"color: #555555;\">=<\/span> std<span style=\"color: #555555;\">::<\/span>acos(<span style=\"color: #555555;\">-<\/span><span style=\"color: #ff6600;\">1<\/span>);\r\n\r\n<span style=\"color: #006699; font-weight: bold;\">template<\/span> <span style=\"color: #555555;\">&lt;<\/span><span style=\"color: #006699; font-weight: bold;\">typename<\/span> Func<span style=\"color: #555555;\">&gt;<\/span>\r\n<span style=\"color: #007788; font-weight: bold;\">void<\/span> getExecutionTime(<span style=\"color: #006699; font-weight: bold;\">const<\/span> std<span style=\"color: #555555;\">::<\/span>string<span style=\"color: #555555;\">&amp;<\/span> title, Func func){                   <span style=\"color: #0099ff;\">\/\/ (4)<\/span>\r\n    \r\n  <span style=\"color: #006699; font-weight: bold;\">const<\/span> <span style=\"color: #006699; font-weight: bold;\">auto<\/span> sta <span style=\"color: #555555;\">=<\/span> std<span style=\"color: #555555;\">::<\/span>chrono<span style=\"color: #555555;\">::<\/span>steady_clock<span style=\"color: #555555;\">::<\/span>now();\r\n  func();                                                                    <span style=\"color: #0099ff;\"> \/\/ (5)<\/span>\r\n  <span style=\"color: #006699; font-weight: bold;\">const<\/span> std<span style=\"color: #555555;\">::<\/span>chrono<span style=\"color: #555555;\">::<\/span>duration<span style=\"color: #555555;\">&lt;<\/span><span style=\"color: #007788; font-weight: bold;\">double<\/span><span style=\"color: #555555;\">&gt;<\/span> dur <span style=\"color: #555555;\">=<\/span> std<span style=\"color: #555555;\">::<\/span>chrono<span style=\"color: #555555;\">::<\/span>steady_clock<span style=\"color: #555555;\">::<\/span>now() <span style=\"color: #555555;\">-<\/span> sta;\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> title <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\": \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> dur.count() <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">\" sec. \"<\/span> <span style=\"color: #555555;\">&lt;&lt;<\/span> std<span style=\"color: #555555;\">::<\/span>endl;\r\n     \r\n}\r\n    \r\n<span style=\"color: #007788; font-weight: bold;\">int<\/span> main(){\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">'\\n'<\/span>;\r\n    \r\n  std<span style=\"color: #555555;\">::<\/span>vector<span style=\"color: #555555;\">&lt;<\/span><span style=\"color: #007788; font-weight: bold;\">double<\/span><span style=\"color: #555555;\">&gt;<\/span> randValues;\r\n  randValues.reserve(size);\r\n   \r\n  std<span style=\"color: #555555;\">::<\/span>mt19937 engine;\r\n  std<span style=\"color: #555555;\">::<\/span>uniform_real_distribution<span style=\"color: #555555;\">&lt;&gt;<\/span> uniformDist(<span style=\"color: #ff6600;\">0<\/span>, pi <span style=\"color: #555555;\">\/<\/span> <span style=\"color: #ff6600;\">2<\/span>);\r\n  <span style=\"color: #006699; font-weight: bold;\">for<\/span> (<span style=\"color: #007788; font-weight: bold;\">long<\/span> <span style=\"color: #007788; font-weight: bold;\">long<\/span> i <span style=\"color: #555555;\">=<\/span> <span style=\"color: #ff6600;\">0<\/span> ; i <span style=\"color: #555555;\">&lt;<\/span> size ; <span style=\"color: #555555;\">++<\/span>i) randValues.push_back(uniformDist(engine));\r\n    \r\n  std<span style=\"color: #555555;\">::<\/span>vector<span style=\"color: #555555;\">&lt;<\/span><span style=\"color: #007788; font-weight: bold;\">double<\/span><span style=\"color: #555555;\">&gt;<\/span> workVec(randValues);\r\n    \r\n  getExecutionTime(<span style=\"color: #cc3300;\">\"std::execution::seq\"<\/span>, [workVec]() <span style=\"color: #006699; font-weight: bold;\">mutable<\/span> {              <span style=\"color: #0099ff;\">  \/\/ (6)<\/span>\r\n    std<span style=\"color: #555555;\">::<\/span>transform(std<span style=\"color: #555555;\">::<\/span>execution<span style=\"color: #555555;\">::<\/span>seq, workVec.begin(), workVec.end(),        <span style=\"color: #0099ff;\">\/\/ (1)<\/span>\r\n\t\t   workVec.begin(), \r\n                   [](<span style=\"color: #007788; font-weight: bold;\">double<\/span> arg){ <span style=\"color: #006699; font-weight: bold;\">return<\/span> std<span style=\"color: #555555;\">::<\/span>tan(arg); }              \r\n                  );\r\n    });\r\n        \r\n  getExecutionTime(<span style=\"color: #cc3300;\">\"std::execution::par\"<\/span>, [workVec]() <span style=\"color: #006699; font-weight: bold;\">mutable<\/span> {                <span style=\"color: #0099ff;\">\/\/ (7)<\/span>\r\n    std<span style=\"color: #555555;\">::<\/span>transform(std<span style=\"color: #555555;\">::<\/span>execution<span style=\"color: #555555;\">::<\/span>par, workVec.begin(), workVec.end(),        <span style=\"color: #0099ff;\">\/\/ (2)<\/span>\r\n\t\t   workVec.begin(), \r\n                   [](<span style=\"color: #007788; font-weight: bold;\">double<\/span> arg){ <span style=\"color: #006699; font-weight: bold;\">return<\/span> std<span style=\"color: #555555;\">::<\/span>tan(arg); }\r\n                  );\r\n  });\r\n     \r\n  getExecutionTime(<span style=\"color: #cc3300;\">\"std::execution::par_unseq\"<\/span>, [workVec]() <span style=\"color: #006699; font-weight: bold;\">mutable<\/span> {         <span style=\"color: #0099ff;\"> \/\/ (8)<\/span>\r\n    std<span style=\"color: #555555;\">::<\/span>transform(std<span style=\"color: #555555;\">::<\/span>execution<span style=\"color: #555555;\">::<\/span>par_unseq, workVec.begin(), workVec.end(),  <span style=\"color: #0099ff;\">\/\/ (3)<\/span>\r\n\t\t   workVec.begin(), \r\n                   [](<span style=\"color: #007788; font-weight: bold;\">double<\/span> arg){ <span style=\"color: #006699; font-weight: bold;\">return<\/span> std<span style=\"color: #555555;\">::<\/span>tan(arg); }\r\n                  );\r\n  });\r\n\r\n  std<span style=\"color: #555555;\">::<\/span>cout <span style=\"color: #555555;\">&lt;&lt;<\/span> <span style=\"color: #cc3300;\">'\\n'<\/span>;\r\n    \r\n}\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>First, the vector <code>randValues<\/code> is filled with 500 million numbers from the half-open interval [0, pi \/ 2 [. The function template <code>getExecutionTime<\/code> (4) gets the name of the execution policy, and the lambda function executes the lambda function (5) and shows the execution time. There is one particular point about the three lambda functions ((6), (7), and (8)) used in this program. They are declared mutable. This is necessary because the lambda functions modify their argument <code>workVec<\/code>. Lambda functions are, per default, constant. If a lambda function wants to change its values, it has to be declared mutable.<\/p>\n<p>Let me start with the windows performance numbers. But before I do that, I have to make a short disclaimer.<\/p>\n<h3>Disclaimer<\/h3>\n<p>I explicitly want to emphasize this. I don&#8217;t want to compare Windows and Linux because both computers run Windows, and Linux has different capabilities. These performance numbers should only give you a gut feeling. If you want the numbers for your system, you must repeat the test.<\/p>\n<p>I use maximum optimization on Windows and Linux. This means for Windows, the flag<code> \/O2<\/code> and on Linux, the flag<code> -O3<\/code>.<\/p>\n<p>To make it short. I&#8217;m keen to know if the parallel execution of the STL algorithms pays and to what extent. My main focus is the relative performance of sequential and parallel execution.<\/p>\n<h3>Windows<\/h3>\n<p>My windows laptop has eight logical cores, but the parallel execution is more than ten times faster.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-6182\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelSTLPerformance.PNG\" alt=\"parallelSTLPerformance\" width=\"600\" height=\"252\" style=\"display: block; margin-left: auto; margin-right: auto;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelSTLPerformance.PNG 2321w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelSTLPerformance-300x126.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelSTLPerformance-1024x430.png 1024w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelSTLPerformance-768x322.png 768w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelSTLPerformance-1536x645.png 1536w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelSTLPerformance-2048x859.png 2048w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p>The numbers for the parallel and the parallel and vectorized execution are in the same ballpark. Here is the explanation for the Visual C++ Team Blog: <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/using-c17-parallel-algorithms-for-better-performance\/\">Using C++17 Parallel Algorithms for Better Performance<\/a>: <em>Note that the Visual C++ implementation implements the parallel and parallel unsequenced policies the same way, so you should not expect better performance for using par_unseq on our implementation, but implementations may exist that can use that additional freedom someday.<\/em><\/p>\n<h3>Linux<\/h3>\n<p>My Linux computer has only four cores. Here are the numbers.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-6183\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelPerformanceSTL.png\" alt=\"parallelPerformanceSTL\" width=\"483\" height=\"250\" style=\"display: block; margin-left: auto; margin-right: auto;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelPerformanceSTL.png 483w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2021\/07\/parallelPerformanceSTL-300x155.png 300w\" sizes=\"auto, (max-width: 483px) 100vw, 483px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>The numbers are as expected. I have four cores, and the parallel execution is about four times faster than the sequential execution. The performance numbers of the parallel and vectorized version and the parallel version are in the same ballpark. Therefore, I assume that the GCC compiler uses the same strategy as the Windows compiler. When I ask for parallel and vectorized execution by using the execution policy <code>std::execute::par_unseq<\/code>I get the parallel execution policy (<code>std::execute::par<\/code>). This behavior is according to the C++17 standard because the execution policies are only a hint for the compiler.<\/p>\n<p>To my knowledge, neither the Windows compiler nor the GCC compiler supports the parallel and vectorized execution of parallel STL algorithms. When you want to see the parallel and vectorized algorithms in action, Nvidias STL implementation <a href=\"https:\/\/docs.nvidia.com\/cuda\/thrust\/index.html\">Thrust <\/a>may be an ideal candidate. For further information, read the following Nvidia post: <a href=\"https:\/\/developer.nvidia.com\/blog\/accelerating-standard-c-with-gpus-using-stdpar\/\" target=\"_blank\" rel=\"external noopener\">&#8220;C++ Standard Parallelism<\/a>&#8220;.<\/p>\n<h2>What&#8217;s next?<\/h2>\n<p>&nbsp;After this C++17 detour, I return to my original path: templates. In my<a href=\"https:\/\/www.modernescpp.com\/index.php\/template-instantiation\"> next post,<\/a> I will dive deeper into templates and write about template instantiation.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my last post, &#8220;Parallel Algorithms of the STL with the GCC Compiler&#8220;, I presented the necessary theory about the C++17 algorithm. Today, I made a performance test using the Microsoft and GCC compiler to answer the simple question: Does the execution policy pay off?<\/p>\n","protected":false},"author":21,"featured_media":6176,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[370],"tags":[444],"class_list":["post-6184","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-c-17","tag-parallel-stl"],"_links":{"self":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/6184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/comments?post=6184"}],"version-history":[{"count":1,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/6184\/revisions"}],"predecessor-version":[{"id":6704,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/6184\/revisions\/6704"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media\/6176"}],"wp:attachment":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media?parent=6184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/categories?post=6184"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/tags?post=6184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}