{"id":4903,"date":"2016-09-03T15:55:43","date_gmt":"2016-09-03T15:55:43","guid":{"rendered":"https:\/\/www.modernescpp.com\/index.php\/single-threaded-sum-of-the-elements-of-a-vector\/"},"modified":"2023-06-26T12:44:22","modified_gmt":"2023-06-26T12:44:22","slug":"single-threaded-sum-of-the-elements-of-a-vector","status":"publish","type":"post","link":"https:\/\/www.modernescpp.com\/index.php\/single-threaded-sum-of-the-elements-of-a-vector\/","title":{"rendered":"Single Threaded: Summation of a Vector"},"content":{"rendered":"<p>What is the fastest way to add the elements of a <span style=\"font-family: courier new,courier;\">std::vector<\/span>? This a question that I will pursue in the following posts. I use the single-threaded addition as the reference number. In further posts, I discuss atomics, locks, and thread-local data.<\/p>\n<p><!--more--><\/p>\n<p>&nbsp;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"My_strategy\"><\/span>My strategy<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>My plan is to fill a std::vector with one hundred million arbitrary numbers between 1 and 10. I apply the uniform distribution to get the numbers. The task is to calculate the sum of all values.&nbsp;<\/p>\n<p>I usually use my Linux desktop and Windows laptop to get the numbers. The Linux PC has four, and my Windows PC has two cores. Here are details of my compilers: <a href=\"https:\/\/www.modernescpp.com\/index.php\/thread-safe-initialization-of-a-singleton\">Thread-safe initialization of a singleton<\/a>. I will measure the performance with and without maximum optimization.<\/p>\n<\/p>\n<h2><span class=\"ez-toc-section\" id=\"A_simple_loop\"><\/span>A simple loop<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The obvious strategy is to add the numbers in a range-based for loop.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ calculateWithLoop.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  <span style=\"color: #008000;\">\/\/ random values<\/span>\r\n  std::random_device seed;\r\n  std::mt19937 engine(seed());\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n \r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> add= {};\r\n  <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> n: randValues) add+= n;\r\n \r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; add &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The&nbsp;calculation takes place in line 26. How fast are my computers?&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>&nbsp;<img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4886\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoop.png\" alt=\"CalculateWithLoop\" width=\"350\" height=\"174\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoop.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoop-300x149.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4887\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoop_win.PNG\" alt=\"CalculateWithLoop win\" width=\"350\" height=\"145\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoop_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoop_win-300x124.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>&nbsp;<img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4888\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoopOpt.png\" alt=\"CalculateWithLoopOpt\" width=\"350\" height=\"174\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoopOpt.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoopOpt-300x149.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4889\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoopOpt_win.PNG\" alt=\"CalculateWithLoopOpt win\" width=\"350\" height=\"145\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoopOpt_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLoopOpt_win-300x124.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<p>You should not use explicit loops. A rule which, in particular, holds for Windows. That&#8217;s simple therefore, I have to look in the Standard Template Library (STL).&nbsp;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The_STL_with_std_accumulate\"><\/span>The STL with std::accumulate<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-family: courier new,courier;\">std::accumulate<\/span> is the standard way to calculate the sum.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ calculateWithStd.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;algorithm&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;numeric&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  <span style=\"color: #008000;\">\/\/ random values<\/span>\r\n  std::random_device seed;\r\n  std::mt19937 engine(seed());\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n \r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> add= std::accumulate(randValues.begin(),randValues.end(),0);\r\n \r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; add &lt;&lt; std::endl;\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The key lines are line 27. The performance of<span style=\"font-family: courier new,courier;\"> std::acummulate<\/span> corresponds to the performance of the range-based for loop. But not for Windows.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization-2\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<h3><span class=\"ez-toc-section\" id=\"i\"><\/span>&nbsp;<img decoding=\"async\" class=\" size-full wp-image-4890\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStd.png\" alt=\"CalculateWithStd\" width=\"350\" height=\"NaN\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStd.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStd-300x149.png 300w\" sizes=\"(max-width: 399px) 100vw, 399px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4891\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStd_win.PNG\" alt=\"CalculateWithStd win\" width=\"350\" height=\"145\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStd_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStd_win-300x124.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization-2\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>&nbsp;<img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4892\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStdOpt.png\" alt=\"CalculateWithStdOpt\" width=\"350\" height=\"174\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStdOpt.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStdOpt-300x149.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4893\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStdOpt_win.PNG\" alt=\"CalculateWithStdOpt win\" width=\"350\" height=\"145\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStdOpt_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithStdOpt_win-300x124.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<p>That was all. I have my numbers to compare the single-threaded with the multithreading program. Really? I&#8217;m very curious about protecting the summation with a lock or using an atomic. So we get the overhead of protection.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Protection_by_a_lock\"><\/span>Protection by a lock<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If I protect the access to the summation variable with a lock, I get the answers to my two questions.<span style=\"font-family: courier new,courier;\"><\/span><\/p>\n<ol>\n<li>How expensive is the synchronization of a lock?<\/li>\n<li>How fast can a lock be if no concurrent access to a variable takes place?<\/li>\n<\/ol>\n<p>Of course, I can rephrase point 2. If more the one thread accesses the shared variable, the access time decreases.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35\r\n36\r\n37\r\n38\r\n39\r\n40\r\n41<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ calculateWithLock.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;mutex&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;numeric&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  <span style=\"color: #008000;\">\/\/ random values<\/span>\r\n  std::random_device seed;\r\n  std::mt19937 engine(seed());\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n\r\n  std::mutex myMutex;\r\n \r\n  <span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">int<\/span> add= 0;\r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> i: randValues){\r\n      std::lock_guard&lt;std::mutex&gt; myLockGuard(myMutex);\r\n      add+= i;\r\n  }\r\n \r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; add &lt;&lt; std::endl;\r\n    \r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>That&#8217;s the result I assumed. The access on the protected variable <span style=\"font-family: courier new,courier;\">add<\/span> is slower.<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<p>&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization-3\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4894\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLock.png\" alt=\"CalculateWithLock\" width=\"350\" height=\"174\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLock.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLock-300x149.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4895\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLock_win.PNG\" alt=\"CalculateWithLock win\" width=\"350\" height=\"145\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLock_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLock_win-300x124.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization-3\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4896\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLockOpt.png\" alt=\"CalculateWithLockOpt\" width=\"350\" height=\"174\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLockOpt.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLockOpt-300x149.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4897\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLockOpt_win.PNG\" alt=\"CalculateWithLockOpt win\" width=\"350\" height=\"145\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLockOpt_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithLockOpt_win-300x124.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<p>The non-optimized version with the lock is 4 &#8211; 12 times slower than the maximum optimized. The maximum optimized about 10 &#8211; 40 times slower than <span style=\"font-family: courier new,courier;\">std::accumulate.<\/span><\/p>\n<p>Finally atomics.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Protection_by_an_atomic\"><\/span>Protection by an atomic<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The same question for locks holds also for atomics.<\/p>\n<ol>\n<li>How expensive is the synchronization of a lock?<\/li>\n<li>How fast can a lock be if no concurrent access to a variable takes place?<\/li>\n<\/ol>\n<p>But there is an additional question. What is the performance difference between an atomic and a lock?<\/p>\n<p>&nbsp;<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff; overflow: auto; width: auto; gray;border-width: .1em .1em .1em .8em;\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"> 1\r\n 2\r\n 3\r\n 4\r\n 5\r\n 6\r\n 7\r\n 8\r\n 9\r\n10\r\n11\r\n12\r\n13\r\n14\r\n15\r\n16\r\n17\r\n18\r\n19\r\n20\r\n21\r\n22\r\n23\r\n24\r\n25\r\n26\r\n27\r\n28\r\n29\r\n30\r\n31\r\n32\r\n33\r\n34\r\n35\r\n36\r\n37\r\n38\r\n39\r\n40\r\n41\r\n42\r\n43\r\n44\r\n45\r\n46\r\n47\r\n48\r\n49\r\n50<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008000;\">\/\/ calculateWithAtomic.cpp<\/span>\r\n\r\n<span style=\"color: #0000ff;\">#include &lt;atomic&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;chrono&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;iostream&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;numeric&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;random&gt;<\/span>\r\n<span style=\"color: #0000ff;\">#include &lt;vector&gt;<\/span>\r\n\r\nconstexpr <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> size= 100000000;\r\n\r\n<span style=\"color: #2b91af;\">int<\/span> main(){\r\n\r\n  std::cout &lt;&lt; std::endl;\r\n\r\n  std::vector&lt;<span style=\"color: #2b91af;\">int<\/span>&gt; randValues;\r\n  randValues.reserve(size);\r\n\r\n  <span style=\"color: #008000;\">\/\/ random values<\/span>\r\n  std::random_device seed;\r\n  std::mt19937 engine(seed());\r\n  std::uniform_int_distribution&lt;&gt; uniformDist(1,10);\r\n  <span style=\"color: #0000ff;\">for<\/span> ( <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span> i=0 ; i&lt; size ; ++i) randValues.push_back(uniformDist(engine));\r\n  \r\n  std::atomic&lt;<span style=\"color: #2b91af;\">unsigned<\/span> <span style=\"color: #2b91af;\">long<\/span> <span style=\"color: #2b91af;\">long<\/span>&gt; add={};\r\n  std::cout &lt;&lt; std::boolalpha &lt;&lt; <span style=\"color: #a31515;\">\"add.is_lock_free(): \"<\/span> &lt;&lt; add.is_lock_free() &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; std::endl;\r\n \r\n  <span style=\"color: #0000ff;\">auto<\/span> start = std::chrono::system_clock::now();\r\n  \r\n  <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> i: randValues) add+= i;\r\n \r\n  std::chrono::duration&lt;<span style=\"color: #2b91af;\">double<\/span>&gt; dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; add &lt;&lt; std::endl;\r\n    \r\n  std::cout &lt;&lt; std::endl;\r\n  \r\n  add=0;\r\n  start = std::chrono::system_clock::now();\r\n  \r\n  <span style=\"color: #0000ff;\">for<\/span> (<span style=\"color: #0000ff;\">auto<\/span> i: randValues) add.fetch_add(i,std::memory_order_relaxed);\r\n  \r\n  dur= std::chrono::system_clock::now() - start;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Time for addition \"<\/span> &lt;&lt; dur.count() &lt;&lt; <span style=\"color: #a31515;\">\" seconds\"<\/span> &lt;&lt; std::endl;\r\n  std::cout &lt;&lt; <span style=\"color: #a31515;\">\"Result: \"<\/span> &lt;&lt; add &lt;&lt; std::endl;\r\n    \r\n  std::cout &lt;&lt; std::endl;\r\n\r\n}\r\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The program is special. First, I ask in line 26 if the atomic has a lock. That is crucial because otherwise, there is no difference between using locks and atomics. On all mainstream platforms, I knew atomics use no lock. Second I calculate the sum in two ways. I use in line 31 the += operator; in line 42, the method <span style=\"font-family: courier new,courier;\">fetch_add<\/span>. Both variants have, in the single-threaded case, a comparable performance, but I can explicitly specify in the case of <span style=\"font-family: courier new,courier;\">fetch_add<\/span> the memory model. More about that point in the next post.<\/p>\n<p>&nbsp;<\/p>\n<p>But now the numbers.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Without_optimization-4\"><\/span>Without optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>&nbsp;<img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4898\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomic.png\" alt=\"CalculateWithAtomic\" width=\"350\" height=\"242\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomic.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomic-300x208.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4899\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomic_win.PNG\" alt=\"CalculateWithAtomic win\" width=\"350\" height=\"220\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomic_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomic_win-300x189.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Maximum_optimization-4\"><\/span>Maximum optimization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4900\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomicOpt.png\" alt=\"CalculateWithAtomicOpt\" width=\"350\" height=\"242\" style=\"margin: 15px;\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomicOpt.png 399w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomicOpt-300x208.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4901\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomicOpt_win.PNG\" alt=\"CalculateWithAtomicOpt win\" width=\"350\" height=\"220\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomicOpt_win.PNG 652w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/CalculateWithAtomicOpt_win-300x189.png 300w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<p>In the case of Linux, the atomics are 1.5 times slower, and in the case of windows eight times slower than the <span style=\"font-family: courier new,courier;\">std::accumulate<\/span> algorithm. That changes even more in the case of optimization. Now Linux is 15 times; Windows is 50 times faster.<\/p>\n<p>I want to stress two points.<\/p>\n<ol>\n<li>Atomics are 2 &#8211; 3 times faster on Linux and Windows than locks.<\/li>\n<li>Linux is, in particular, for atomics two times faster than Windows.<\/li>\n<\/ol>\n<h2><span class=\"ez-toc-section\" id=\"All_numbers_compact\"><\/span>All numbers compact<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>How lost the orientation because of the number? Here is the overview in seconds.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-4902\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/SingleThreadedAdditionEng.png\" alt=\"SingleThreadedAdditionEng\" style=\"margin: 10px;\" width=\"607\" height=\"276\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/SingleThreadedAdditionEng.png 607w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2016\/09\/SingleThreadedAdditionEng-300x136.png 300w\" sizes=\"auto, (max-width: 607px) 100vw, 607px\" \/><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Whats_next\"><\/span>What&#8217;s next?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Single-threaded becomes multithreaded in the<a href=\"https:\/\/www.modernescpp.com\/index.php\/multithreaded-summation-of-a-vector\"> next post<\/a>. In the first step, the summation variable add becomes a shared variable used by four threads. In the second step, <span style=\"font-family: courier new,courier;\">add<\/span> will be atomic.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is the fastest way to add the elements of a std::vector? This a question that I will pursue in the following posts. I use the single-threaded addition as the reference number. In further posts, I discuss atomics, locks, and thread-local data.<\/p>\n","protected":false},"author":21,"featured_media":4886,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[369],"tags":[434,430,470],"class_list":["post-4903","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-multithreading-application","tag-atomics","tag-lock","tag-performance"],"_links":{"self":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/4903","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/comments?post=4903"}],"version-history":[{"count":1,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/4903\/revisions"}],"predecessor-version":[{"id":6951,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/4903\/revisions\/6951"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media\/4886"}],"wp:attachment":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media?parent=4903"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/categories?post=4903"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/tags?post=4903"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}