Multithreaded: Summation of a Vector

Contents[Show]

My goal is it to sum up all elements of a vector. I used in the last post a single thread. In this post I use multiple threads and therefore the full power of my PC. The addition will be done on a shared variable. What at first glance seems like a good idea is a very naive strategy. The synchronization overhead of the summation variable is higher than the performance benefit of my four or two cores.

 

The strategy

I sum up 100 000 000 million random numbers between 1 and 10 In accordance with my last post. To be sure that my calculation is right I reduce the randomness. So I use no seed and I get each time the same random numbers on my two architectures. Therefore it's easy to verify my total result. Both calculations will run on my 4 CPU Linux and my 2 CPU Windows PC. As ever with maximum and without optimization. On Windows I was very puzzled. 

What are the interesting questions?

  1. How differs locks and atomics?
  2. What's the difference between the single threaded and the multithreading execution of std::accumulate?

Protection of the shared variable with the std::lock_guard

The simplest way to protect a shared variable is to wrap a mutex in a lock.

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// synchronizationWithLock.cpp

#include <chrono>
#include <iostream>
#include <mutex>
#include <random>
#include <thread>
#include <utility>
#include <vector>

constexpr long long size= 100000000;   

constexpr long long firBound=  25000000;
constexpr long long secBound=  50000000;
constexpr long long thiBound=  75000000;
constexpr long long fouBound= 100000000;

std::mutex myMutex;

void sumUp(unsigned long long& sum, const std::vector<int>& val, unsigned long long beg, unsigned long long end){
    for (auto it= beg; it < end; ++it){
        std::lock_guard<std::mutex> myLock(myMutex);
        sum+= val[it];
    }
}

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  std::mt19937 engine;
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));
 
  unsigned long long sum= 0;
  auto start = std::chrono::system_clock::now();
  
  std::thread t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);
  std::thread t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);
  std::thread t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);
  std::thread t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   
  
 
  t1.join();
  t2.join();
  t3.join();
  t4.join();
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << sum << std::endl;

  std::cout << std::endl;

}

 

The program is easy to explain. The function sumUp (line 20 - 25) is the work package, each thread has to perform. This work package consists of the summation variable sum and the std::vector val, both getting by reference. beg and end limit the range on which the summation takes place. As already mentioned I use a std::lock_guard (line 22) to protect the shared variable. Each thread line 41 - 44 does a quarter of the work. 

Here are the numbers of the program.

Without optimization

synchronizeWithLocksynchronizeWithLock win

Maximum optimization

synchronizeWithLockOptsynchronizeWithLockOpt win

The bottleneck of the program is the shared variable, expensive protected by a std::lock_guard. Therefore the obvious solution is to replace the heavyweight lock by a lightweight atomic.

Addition with an atomic

The variable sum is atomic. So I can skip the std::lock_guard in the function sumUp (line 18 - 22). That was all.

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// synchronizationWithAtomic.cpp

#include <atomic>
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <utility>
#include <vector>

constexpr long long size= 100000000;   

constexpr long long firBound=  25000000;
constexpr long long secBound=  50000000;
constexpr long long thiBound=  75000000;
constexpr long long fouBound= 100000000;

void sumUp(std::atomic<unsigned long long>& sum, const std::vector<int>& val, unsigned long long beg, unsigned long long end){
    for (auto it= beg; it < end; ++it){
        sum+= val[it];
    }
}

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  std::mt19937 engine;
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));
 
  std::atomic<unsigned long long> sum(0);
  auto start = std::chrono::system_clock::now();
  
  std::thread t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);
  std::thread t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);
  std::thread t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);
  std::thread t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   
  
 
  t1.join();
  t2.join();
  t3.join();
  t4.join();
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << sum << std::endl;

  std::cout << std::endl;

}

 

Without optimization

synchronizeWithAtomicsynchronizeWithAtomic win

Maximum optimization

synchronizeWithAtomicOptsynchronizeWithAtomicOpt win

A strange phenomenon

If you study the numbers carefully you will notice a strange phenomenon on Windows. The maximum optimized program is slower than the non-optimized. That observation will also hold for the next two variations. This puzzled me. I executed the program in addition on a virtualized Windows 8 PC with only one core. Here the optimized version was faster. Something strange is going on with my Windows 10 PC and atomics.

Beside += there is a further way to calculate the sum of an atomic with fetch_add. Let's try it out. The numbers should be similar.

 

Addition with fetch_add

The change in the source code is minimal. I have only to touch the line 20.

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// synchronizationWithFetchAdd.cpp

#include <atomic>
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <utility>
#include <vector>

constexpr long long size= 100000000;   

constexpr long long firBound=  25000000;
constexpr long long secBound=  50000000;
constexpr long long thiBound=  75000000;
constexpr long long fouBound= 100000000;

void sumUp(std::atomic<unsigned long long>& sum, const std::vector<int>& val, unsigned long long beg, unsigned long long end){
    for (auto it= beg; it < end; ++it){
		sum.fetch_add(val[it]);
    }
}

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  std::mt19937 engine;
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));
 
  std::atomic<unsigned long long> sum(0);
  auto start = std::chrono::system_clock::now();
  
  std::thread t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);
  std::thread t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);
  std::thread t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);
  std::thread t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   
  
 
  t1.join();
  t2.join();
  t3.join();
  t4.join();
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << sum << std::endl;

  std::cout << std::endl;

}

 

Without optimization

synchronizeWithFetchAddsynchronizeWithFetchAdd win

Maximum optimization

synchronizeWithFetchAddOptsynchronizeWithFetchAddOpt win

Strictly speaking is the fetch_add variation no improvement to the += variation but the contrary. The += variation is more intuitive. But wait there is a small difference.

Addition with fetch_add and relaxed semantic

The default behaviour for atomics is sequential consistency. This statement is true for the addition and assignment of an atomic and of course for the fetch_add variant. But we can do better. Let's adjust the memory model with the fetch variations. That's the final step in my optimization. You see it in line 20.

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// synchronizationWithFetchAddRelaxed.cpp

#include <atomic>
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <utility>
#include <vector>

constexpr long long size= 100000000;   

constexpr long long firBound=  25000000;
constexpr long long secBound=  50000000;
constexpr long long thiBound=  75000000;
constexpr long long fouBound= 100000000;

void sumUp(std::atomic<unsigned long long>& sum, const std::vector<int>& val, unsigned long long beg, unsigned long long end){
    for (auto it= beg; it < end; ++it){
		sum.fetch_add(val[it],std::memory_order_relaxed);
    }
}

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  std::mt19937 engine;
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));
 
  std::atomic<unsigned long long> sum(0);
  auto start = std::chrono::system_clock::now();
  
  std::thread t1(sumUp,std::ref(sum),std::ref(randValues),0,firBound);
  std::thread t2(sumUp,std::ref(sum),std::ref(randValues),firBound,secBound);
  std::thread t3(sumUp,std::ref(sum),std::ref(randValues),secBound,thiBound);
  std::thread t4(sumUp,std::ref(sum),std::ref(randValues),thiBound,fouBound);   
  
 
  t1.join();
  t2.join();
  t3.join();
  t4.join();
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << sum << std::endl;

  std::cout << std::endl;

}

 

The question is. Why is it ok to use the relaxed-semantic in line 20? relaxed-semantic will not guarantee that one threads sees the operation in another thread in the same order. But this is not necessary. Necessary is only that each addition is atomically performed. 

Does the optimization pay off?

Without optimization

synchronizeWithFetchAddRelaxedsynchronizeWithFetchAddRelaxed win

Maximum optimization

synchronizeWithFetchAddRelaxedOptsynchronizeWithFetchAddRelaxedOpt win

As expected, for Linux and GCC is the fetch_add variant with relaxed-semantic the fastest one. I'm still puzzled with Windows.

At the end all numbers together in a table.

The overview

Although I have successively optimized the access to the shared variable and improved accordingly the performance, the results are not very promising. The addition in the single threaded case with std::accumulate is far faster. To say it precisely 40 times.

MultithraedeSharedVariableEng

What's next?

I will combine in the next post the best out of the two worlds. I combine the non-synchronized summation in one thread with the power of many threads. Let's see, if I beat the performance of the single thread variant of std::accumulate.

 

 

 

 

 

title page smalltitle page small Go to Leanpub/cpplibrary "What every professional C++ programmer should know about the C++ standard library".   Get your e-book. Support my blog.

 

Comments   

0 #1 minirop 2016-09-06 09:44
would using a local variable in sumUp and only at the end do "sum += tempTotal" be faster?
Quote
0 #2 Rainer Grimm 2016-09-06 10:05
Quoting minirop:
would using a local variable in sumUp and only at the end do "sum += tempTotal" be faster?
You are totally right but that's the next post.
Quote
0 #3 monah_tuk 2016-09-09 02:36
Quoting Rainer Grimm:
You are totally right but that's the next post.


This idea is first thing that go to head :-) It totally quickly!

Also, this case when locks can be omitted at all: threads make sum for partial ranges and than partial values sums at the "sheduler" thread.

My results with/without optimization with running original tests for base line: http://pastebin.com/ZDCb4QXE

Note: "-o2" suffixes indicates that this test like test without "-o2" but built with "-O2".

Interesting thing: case without locks built without optimization slower then atomic/lock one.

Test without locks: http://pastebin.com/Hzkg5DZE

All tests was built with G++ 6.2.0
Quote
0 #4 Music Albums 2016-11-15 00:45
Wow! In the end I got a website from where I know how to genuinely take
helpful facts concerning my study and knowledge.
Quote
0 #5 Music Albums 2016-12-26 10:52
Wonderful web site. Lots of helpful information here. I am sending it to some
pals ans additionally sharing in delicious. And certainly, thanks for your sweat!
Quote

Add comment


My Newest E-Book

Latest comments

Subscribe to the newsletter (+ pdf bundle)

Blog archive

Source Code

Visitors

Today 424

All 252655

Currently are 213 guests and no members online