Parallel Algorithms of the STL with the GCC Compiler


GCC supports my favorite C++17 feature: the parallel algorithms of the Standard Template Library (STL). I recognized this a few days ago, and I'm happy to write a post about it and share my enthusiasm.



The Microsoft compiler supports the parallel algorithms since their beginning but sadly neither GCC nor Clang. I have to be precise, since GCC 9 you can use the parallel algorithms. Before I show you examples with performance numbers in my next post, I want to write about the parallel algorithms of the STL and give you the necessary information.

Parallel Algorithms of the Standard Template Library

The Standard Template Library has more than 100 algorithms for searching, counting, and manipulating ranges and their elements. With C++17, 69 of them get new overloads, and new ones are added. The overloaded, and new algorithms can be invoked with a so-called execution policy. Using an execution policy, you can specify whether the algorithm should run sequentially, in parallel, or parallel with vectorization. For using the execution policy, you have to include the header <execution>.

Execution Policy

The C++17 standard defines three execution policies:

  • std::execution::sequenced_policy
  • std::execution::parallel_policy
  • std::execution::parallel_unsequenced_policy

The corresponding policy tag specifies whether a program should run sequentially, in parallel, or parallel with vectorization.

  • std::execution::seq: runs the program sequentially

  • std::execution::par: runs the program in parallel on multiple threads

  • std::execution::par_unseq: runs the program in parallel on multiple threads and allows the interleaving of individual loops; permits a vectorized version with SIMD (Single Instruction Multiple Data).

The usage of the execution policy std::execution::par or std::execution::par_unseq allows the algorithm to run parallel or parallel and vectorized. This policy is a permission and not a requirement.

The following code snippet applies all execution policies.

std::vector<int> v = {1, 2, 3, 4, 5, 6, 7, 8, 9};

// standard sequential sort                              
std::sort(v.begin(), v.end());                             // (1)

// sequential execution
std::sort(std::execution::seq, v.begin(), v.end());        // (2)

// permitting parallel execution
std::sort(std::execution::par, v.begin(), v.end());        // (3)

// permitting parallel and vectorized execution
std::sort(std::execution::par_unseq, v.begin(), v.end());  // (4)

The example shows that you can still use the classic variant of std::sort (4). Besides, in C++17, you can specify explicitly whether the sequential (2), parallel (3), or the parallel and vectorized (4) version should be used.

Parallel and Vectorized Execution

Whether an algorithm runs in a parallel and vectorized way depends on many factors. For example, it depends on whether the CPU and the operating system support SIMD instructions. Additionally, it also depends on the compiler and the optimization level you used to translate your code.

The following example shows a simple loop for filling a vector.
const int SIZE = 8;
int vec[] = {1, 2, 3, 4, 5, 6, 7, 8};
int res[] = {0, 0, 0, 0, 0, 0, 0, 0};
int main() {
    for (int i = 0; i < SIZE; ++i) {
        res[i] = vec[i]+5;


The expression res[i] = vec[i] + 5 is the crucial line in this small example. Thanks to Compiler Explorer, we can have a closer look at the assembler instructions generated by clang 3.6.

Without Optimization

Here are the assembler instructions. Each addition is done sequentially.


With Maximum Optimization

By using the highest optimization level, -O3, special registers such as xmm0 are used that can hold 128 bits or 4 ints. This special register means that the addition takes place in parallel on four elements of the vector.


An overload of an algorithm without an execution policy and an overload of an algorithm with a sequential execution policy std::execution::seq differ in one aspect: exceptions.



If an exception occurs during the usage of an algorithm with an execution policy,std::terminate is called. std::terminate calls the installedstd::terminate_handler. The consequence is that per default std::abort is called, which causes abnormal program termination. The handling of exceptions is the difference between an algorithm’s invocation without an execution policy and an algorithm with a sequential std::execution::seq execution policy. The invocation of the algorithm without an execution policy propagates the exception, and, therefore, the exception can be handled.

With C++17, 69 of the STL algorithms got new overloads, and new algorithms were added.


Here are the 69 algorithms with parallelized versions.


The New Algorithms

The new algorithm in C++17, which are design for parallel execution, are in the std namespace and need the header <numeric>.

  • std::exclusive_scan: Applies from the left a binary callable up to the ith (exclusive) element of the range. The left argument of the callable is the previous result. Stores intermediate results.
  • std::inclusive_scan: Applies from the left a binary callable up to the ith (inclusive) element of the range. The left argument of the callable is the previous result. Stores intermediate results.
  • std::transform_exclusive_scan: First applies a unary callable to the range and then applies std::exclusive_scan.
  • std::transform_inclusive_scan: First applies a unary callable to the range and then applies std::inclusive_scan.
  • std::reduce: Applies a binary callable to the range.
  • std::transform_reduce: Applies first a unary callable to one or a binary callable to two ranges and then std::reduce to the resulting range.

Admittedly this description is not easy to digest, but if you already know std::accumulate and std::partial_sum, the reduce and scan variations should be quite familiar. std::reduce is the parallel pendant to std::accumulate and scan the parallel pendant to partial_sum. The parallel execution is the reason that std::reduce needs an associative and commutative callable. The corresponding statement hold for the scan variations in contrary to the partial_sum variations. To get the full details, visit

You may wonder, why we need std::reduce for parallel execution because we already have std::accumulate. The reason is that std::accumulate processes its elements in an order that cannot be parallelized.

std::accumulate versus std::reduce

While std::accumulate processes its elements from left to the right, std::reduce does it in an arbitrary order. Let me start with a small code snippet using std::accumulate and std::reduce. The callable is the lambda function [](int a, int b){ return a * b; }.


std::vector<int> v{1, 2, 3, 4};

std::accumulate(v.begin(), v.end(), 1, [](int a, int b){ return a * b; });
std::reduce(std::execution::par, v.begin(), v.end(), 1 , [](int a, int b){ return a * b; });


The two following graphs show the different processing strategies of std::accumulate and std::reduce.

  • std::accumulate starts at the left and successively applies the binary operator.


  • On the contrary, std::reduce applies the binary operator in a non-deterministic way.



The associativity of the callable allows the std::reduce algorithm to apply the reduction step on arbitrary adjacents pairs of elements. Thanks to commutativity, the intermediate results can be computed in an arbitrary order.

What's next?

As promised, my next post uses parallel algorithms of the STL and provides performance numbers for the Microsoft compiler and the GCC.

Five Vouchers for Stephan Roth's Book "Clean C++20" to Win

I give away five vouchers for Stephan Roth's book "Clean C++20", sponsored by the book's publisher Apress. Here is how you can get it:


Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, Marko, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Animus24, Jozo Leko, John Breland, espkk, Louis St-Amour, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Kris Kafka, Mario Luoni, Neil Wang, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Tobi Heideman, Daniel Hufschläger, Red Trip, Alexander Schwarz, Tornike Porchxidze, Alessandro Pezzato, Evangelos Denaxas, Bob Perry, Satish Vangipuram, Andi Ireland, Richard Ohnemus, Michael Dunsky, Dimitrov Tsvetomir, Leo Goodstadt, Eduardo Velasquez, John Wiederhirn, Yacob Cohen-Arazi, Florian Tischler, Robin Furness, and Michael Young.


Thanks in particular to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, Richard Sargeant, and Rusty Fleming.



My special thanks to Embarcadero CBUIDER STUDIO FINAL ICONS 1024 Small



I'm happy to give online seminars or face-to-face seminars worldwide. Please call me if you have any questions.

Bookable (Online)


Standard Seminars (English/German)

Here is a compilation of my standard seminars. These seminars are only meant to give you a first orientation.


Contact Me

Modernes C++,




0 #1 Marlon 2021-07-19 08:19
Wow, thanks for sharing, this is probably my favorite feature in C++17!

I wonder why the committee chose to add the ExecutionPolicy&& as the first parameter instead of the last one. Any design motivation behind this?

My Newest E-Books

Course: Modern C++ Concurrency in Practice

Course: C++ Standard Library including C++14 & C++17

Course: Embedded Programming with Modern C++

Course: Generic Programming (Templates)

Course: C++ Fundamentals for Professionals

Interactive Course: The All-in-One Guide to C++20

Subscribe to the newsletter (+ pdf bundle)

Blog archive

Source Code


Today 2021

Yesterday 8162

Week 18891

Month 137093

All 7404933

Currently are 209 guests and no members online

Kubik-Rubik Joomla! Extensions

Latest comments