{"id":4921,"date":"2016-09-05T05:17:33","date_gmt":"2016-09-05T05:17:33","guid":{"rendered":"https:\/\/www.modernescpp.com\/index.php\/multithreaded-summation-of-a-vector\/"},"modified":"2023-06-26T12:43:56","modified_gmt":"2023-06-26T12:43:56","slug":"multithreaded-summation-of-a-vector","status":"publish","type":"post","link":"https:\/\/www.modernescpp.com\/index.php\/multithreaded-summation-of-a-vector\/","title":{"rendered":"Multithreaded: Summation of a Vector"},"content":{"rendered":"<p>My goal is to sum up all elements of a vector. I used in the<a href=\"https:\/\/www.modernescpp.com\/index.php\/single-threaded-sum-of-the-elements-of-a-vector\"> last post<\/a> a single thread. In this post, I use multiple threads and, therefore, the full power of my PC. The addition will be done on a shared variable. What, at first glance, seems like a good idea is a very naive strategy. The synchronization overhead of the summation variable is&nbsp;higher than the performance benefit of my four or two cores.<\/p>\n<p><!--more--><\/p>\n<p>&nbsp;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_strategy\"><\/span>The strategy<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Following my last post, I sum up 100 000 000 million random numbers between 1 and 10. To be sure that my calculation is right I reduce the randomness. So I use no seed, and I get each time the same random numbers on my two architectures each. Therefore it&#8217;s easy to verify my total result. Both calculations will run on my 4-CPU Linux and my 2-CPU Windows PC, as ever, with maximum and without optimization. On Windows, I was very puzzled.&nbsp;<\/p>\n<p>What are the exciting questions?<\/p>\n<ol>\n<li>How differs locks and atomics work?<\/li>\n<li>What&#8217;s the difference between the <a href=\"https:\/\/www.modernescpp.com\/index.php\/single-threaded-sum-of-the-elements-of-a-vector\">single threaded <\/a>and the multithreading execution of <span style=\"font-family: courier new,courier;\">std::accumulate<\/span>?<\/li>\n<\/ol>\n<h2><span class=\"ez-toc-section\" id=\"Protection_of_the_shared_variable_with_the_std_lock_guard\"><\/span>Protection of the shared variable with the std::lock_guard<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The simplest way to protect a shared variable is to wrap a mutex in a lock.<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35\r\n36\r\n37\r\n38\r\n39\r\n40\r\n41\r\n42\r\n43\r\n44\r\n45\r\n46\r\n47\r\n48\r\n49\r\n50\r\n51\r\n52\r\n53\r\n54\r\n55\r\n56\r\n57<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ synchronizationWithLock.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;mutex&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;thread&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;utility&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;   \r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> firBound=  25000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> secBound=  50000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> thiBound=  75000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> fouBound= 100000000;\r\n\r\nstd::mutex myMutex;\r\n\r\n<span style=\"color: #2b91af;\">void<\/span> sumUp(<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&amp; sum, <span style=\"color: #0000ff;\">const<\/span> std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt;&amp; val, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> beg, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> end){\r\n    <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> it= beg; it &lt; end; ++it){\r\n        std::lock_guard&lt;std::mutex&gt; myLock(myMutex);\r\n        sum+= val[it];\r\n    }\r\n}\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  std::mt19937 engine;\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n \r\n  <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> sum= 0;\r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   \r\n  \r\n \r\n  t1.join();\r\n  t2.join();\r\n  t3.join();\r\n  t4.join();\r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; sum &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The program is easy to explain. The function <span style=\"font-family: courier new,courier;\">sumUp<\/span> (lines 20 &#8211; 25) is each thread&#8217;s work package. This work package consists of the summation variable <span style=\"font-family: courier new,courier;\">sum<\/span> and the <span style=\"font-family: courier new,courier;\">std::vector val<\/span>, both getting by reference. beg and end limit the range on which the summation takes place. As mentioned, I use a <span style=\"font-family: courier new,courier;\">std::lock_guard <\/span>(line 22) to protect the shared variable. Each thread line 41 &#8211; 44 does a quarter of the work.&nbsp;<\/p>\n<p>Here are the numbers of the program.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4904\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLock.png\" alt=\"synchronizeWithLock\" width=\"400\" height=\"175\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLock.png 458w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLock-300x131.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4905\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLock_win.png\" alt=\"synchronizeWithLock win\" width=\"450\" height=\"158\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLock_win.png 769w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLock_win-300x105.png 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4906\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLockOpt.png\" alt=\"synchronizeWithLockOpt\" width=\"400\" height=\"175\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLockOpt.png 458w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLockOpt-300x131.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4907\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLockOpt_win.png\" alt=\"synchronizeWithLockOpt win\" width=\"450\" height=\"158\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLockOpt_win.png 769w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithLockOpt_win-300x105.png 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<p>The bottleneck of the program is the shared variable, which is expensively protected by a <span style=\"font-family: courier new,courier;\">std::lock_guard. <\/span>Therefore the obvious solution is to replace the heavyweight lock with a lightweight atomic.<span style=\"font-family: courier new,courier;\"> <\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Addition_with_an_atomic\"><\/span>Addition with an atomic<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The variable <span style=\"font-family: courier new,courier;\">sum<\/span> is atomic. So I can skip the <span style=\"font-family: courier new,courier;\">std::lock_guard<\/span> in the function <span style=\"font-family: courier new,courier;\">sumUp<\/span> (lines 18 &#8211; 22). That was all.<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35\r\n36\r\n37\r\n38\r\n39\r\n40\r\n41\r\n42\r\n43\r\n44\r\n45\r\n46\r\n47\r\n48\r\n49\r\n50\r\n51\r\n52\r\n53\r\n54<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ synchronizationWithAtomic.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;atomic&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;thread&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;utility&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;   \r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> firBound=  25000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> secBound=  50000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> thiBound=  75000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> fouBound= 100000000;\r\n\r\n<span style=\"color: #2b91af;\">void<\/span> sumUp(std::atomic&lt;<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&gt;&amp; sum, <span style=\"color: #0000ff;\">const<\/span> std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt;&amp; val, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> beg, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> end){\r\n    <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> it= beg; it &lt; end; ++it){\r\n        sum+= val[it];\r\n    }\r\n}\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  std::mt19937 engine;\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n \r\n  std::atomic&lt;<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&gt; sum(0);\r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   \r\n  \r\n \r\n  t1.join();\r\n  t2.join();\r\n  t3.join();\r\n  t4.join();\r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; sum &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization-2\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4908\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomic.png\" alt=\"synchronizeWithAtomic\" width=\"400\" height=\"175\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomic.png 458w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomic-300x131.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4909\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomic_win.png\" alt=\"synchronizeWithAtomic win\" width=\"450\" height=\"158\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomic_win.png 769w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomic_win-300x105.png 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization-2\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4910\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomicOpt.png\" alt=\"synchronizeWithAtomicOpt\" width=\"400\" height=\"175\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomicOpt.png 458w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomicOpt-300x131.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4911\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomicOpt_win.png\" alt=\"synchronizeWithAtomicOpt win\" width=\"450\" height=\"158\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomicOpt_win.png 769w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithAtomicOpt_win-300x105.png 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<h2><span class=\"ez-toc-section\" id=\"A_strange_phenomenon\"><\/span>A strange phenomenon<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If you study the numbers carefully, you will notice a strange phenomenon on Windows. The maximum optimized program is slower than the non-optimized one. That observation will also hold for the next two variations. This puzzled me.&nbsp;I executed the program in addition to a virtualized Windows 8 PC with only one core. Here the optimized version was faster. Something strange is going on with my Windows 10 PC and atomics.<\/p>\n<p>Besides +=, there is a further way to calculate the sum of an atomic with fetch_add. Let&#8217;s try it out. The numbers should be similar.<\/p>\n<p>&nbsp;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Addition_with_fetch_add\"><\/span>Addition with fetch_add<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The change in the source code is minimal. I have only to touch line 20.<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35\r\n36\r\n37\r\n38\r\n39\r\n40\r\n41\r\n42\r\n43\r\n44\r\n45\r\n46\r\n47\r\n48\r\n49\r\n50\r\n51\r\n52\r\n53\r\n54<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ synchronizationWithFetchAdd.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;atomic&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;thread&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;utility&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;   \r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> firBound=  25000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> secBound=  50000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> thiBound=  75000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> fouBound= 100000000;\r\n\r\n<span style=\"color: #2b91af;\">void<\/span> sumUp(std::atomic&lt;<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&gt;&amp; sum, <span style=\"color: #0000ff;\">const<\/span> std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt;&amp; val, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> beg, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> end){\r\n    <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> it= beg; it &lt; end; ++it){\r\n\t\tsum.fetch_add(val[it]);\r\n    }\r\n}\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  std::mt19937 engine;\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n \r\n  std::atomic&lt;<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&gt; sum(0);\r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   \r\n  \r\n \r\n  t1.join();\r\n  t2.join();\r\n  t3.join();\r\n  t4.join();\r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; sum &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<h3><span class=\"ez-toc-section\" id=\"i\"><\/span>&nbsp;<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization-3\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4912\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAdd.png\" alt=\"synchronizeWithFetchAdd\" width=\"400\" height=\"175\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAdd.png 458w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAdd-300x131.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><strong><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4913\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAdd_win.png\" alt=\"synchronizeWithFetchAdd win\" width=\"450\" height=\"158\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAdd_win.png 769w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAdd_win-300x105.png 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/strong><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization-3\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4914\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddOpt.png\" alt=\"synchronizeWithFetchAddOpt\" width=\"400\" height=\"175\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddOpt.png 458w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddOpt-300x131.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4915\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddOpt_win.png\" alt=\"synchronizeWithFetchAddOpt win\" width=\"450\" height=\"158\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddOpt_win.png 769w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddOpt_win-300x105.png 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<p>Strictly speaking, is the <span style=\"font-family: courier new,courier;\">fetch_add<\/span> variation no improvement to the += variation but the contrary? The += variation is more intuitive. But wait, there is a small difference.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"In_addition_with_fetch_add_and_relaxed_semantic\"><\/span>In addition, with fetch_add and relaxed semantic<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The default behavior for atomics is <a href=\"https:\/\/www.modernescpp.com\/index.php\/sequential-consistency\">sequential consistency<\/a>. This statement is true for adding and assigning an atomic and, of course, for the fetch_add variant. But we can do better. Let&#8217;s adjust the memory model with the<a href=\"https:\/\/www.modernescpp.com\/index.php\/atomics\"> fetch variations<\/a>. That&#8217;s the final step in my optimization. You see it in line 20.<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35\r\n36\r\n37\r\n38\r\n39\r\n40\r\n41\r\n42\r\n43\r\n44\r\n45\r\n46\r\n47\r\n48\r\n49\r\n50\r\n51\r\n52\r\n53\r\n54<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ synchronizationWithFetchAddRelaxed.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;atomic&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;thread&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;utility&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;   \r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> firBound=  25000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> secBound=  50000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> thiBound=  75000000;\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> fouBound= 100000000;\r\n\r\n<span style=\"color: #2b91af;\">void<\/span> sumUp(std::atomic&lt;<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&gt;&amp; sum, <span style=\"color: #0000ff;\">const<\/span> std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt;&amp; val, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> beg, <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> end){\r\n    <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> it= beg; it &lt; end; ++it){\r\n\t\tsum.fetch_add(val[it],std::memory_order_relaxed);\r\n    }\r\n}\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  std::mt19937 engine;\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n \r\n  std::atomic&lt;<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&gt; sum(0);\r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);\r\n  std::<span style=\"color: #0000ff;\">thread<\/span> t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   \r\n  \r\n \r\n  t1.join();\r\n  t2.join();\r\n  t3.join();\r\n  t4.join();\r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; sum &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The question is. Why is it ok to use the <a href=\"https:\/\/www.modernescpp.com\/index.php\/relaxed-semantic\">relaxed-semantics<\/a> in line 20? Relaxed-semantics will not guarantee that one thread sees the operation in another in the same order. But this is not necessary. Necessary is only that each addition is atomically performed.&nbsp;<\/p>\n<p>Does the optimization pay off?<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization-4\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4916\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxed.png\" alt=\"synchronizeWithFetchAddRelaxed\" width=\"400\" height=\"175\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxed.png 458w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxed-300x131.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4917\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxed_win.png\" alt=\"synchronizeWithFetchAddRelaxed win\" width=\"450\" height=\"143\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxed_win.png 847w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxed_win-300x96.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxed_win-768x245.png 768w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization-4\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4918\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxedOpt.png\" alt=\"synchronizeWithFetchAddRelaxedOpt\" width=\"400\" height=\"158\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxedOpt.png 506w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxedOpt-300x119.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4919\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxedOpt_win.png\" alt=\"synchronizeWithFetchAddRelaxedOpt win\" width=\"450\" height=\"143\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxedOpt_win.png 847w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxedOpt_win-300x96.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/synchronizeWithFetchAddRelaxedOpt_win-768x245.png 768w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<p>As expected, for Linux and GCC is, the <span style=\"font-family: courier new,courier;\">fetch_add<\/span> variant with relaxed-semantic the fastest one. I&#8217;m still puzzled with Windows.<\/p>\n<p>In the end, all numbers are together in a table.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_overview\"><\/span>The overview<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Although I have successively optimized the access to the shared variable and improved the performance accordingly, the results are not very promising. The addition in the <a href=\"https:\/\/www.modernescpp.com\/index.php\/single-threaded-sum-of-the-elements-of-a-vector\">single threaded <\/a>case with <span style=\"font-family: courier new,courier;\">std::accumulate<\/span> is far faster. To say it precisely <strong>40 times.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4920\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/MultithraedeSharedVariableEng.png\" alt=\"MultithraedeSharedVariableEng\" width=\"700\" height=\"186\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/MultithraedeSharedVariableEng.png 881w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/MultithraedeSharedVariableEng-300x80.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/MultithraedeSharedVariableEng-768x204.png 768w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Whats_next\"><\/span>What&#8217;s next?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I will combine in the <a href=\"https:\/\/www.modernescpp.com\/index.php\/multithreaded-summation-with-minimal-synchronization\">next post <\/a>the best of the two worlds. I combine the non-synchronized summation in one thread with the power of many threads. Let&#8217;s see if I beat the performance of the single thread variant of <span style=\"font-family: courier new,courier;\">std::accumulate<\/span>.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>My goal is to sum up all elements of a vector. I used in the last post a single thread. In this post, I use multiple threads and, therefore, the full power of my PC. The addition will be done on a shared variable. What, at first glance, seems like a good idea is a [&hellip;]<\/p>\n","protected":false},"author":21,"featured_media":4904,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[369],"tags":[470],"class_list":["post-4921","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-multithreading-application","tag-performance"],"_links":{"self":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/4921","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/comments?post=4921"}],"version-history":[{"count":1,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/4921\/revisions"}],"predecessor-version":[{"id":6950,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/4921\/revisions\/6950"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media\/4904"}],"wp:attachment":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media?parent=4921"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/categories?post=4921"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/tags?post=4921"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}