{"id":10786,"date":"2025-06-16T10:24:36","date_gmt":"2025-06-16T10:24:36","guid":{"rendered":"https:\/\/www.modernescpp.com\/?p=10786"},"modified":"2025-07-03T14:44:53","modified_gmt":"2025-07-03T14:44:53","slug":"data-parallel-types-simd","status":"publish","type":"post","link":"https:\/\/www.modernescpp.com\/index.php\/data-parallel-types-simd\/","title":{"rendered":"Data-Parallel Types (SIMD)"},"content":{"rendered":"\n<p>The data-parallel types (SIMD) library provides data-parallel types and operations on them. Today, I want to take a closer look at this.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"755\" height=\"509\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/Time26ConcurrencySimd-1.png\" alt=\"\" class=\"wp-image-10788\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/Time26ConcurrencySimd-1.png 755w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/Time26ConcurrencySimd-1-300x202.png 300w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/Time26ConcurrencySimd-1-705x475.png 705w\" sizes=\"auto, (max-width: 755px) 100vw, 755px\" \/><\/figure>\n\n\n\n<p>Before diving into the new library, I would like to take a moment to discuss SIMD.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">SIMD<\/h2>\n\n\n\n<p>Vectorization refers to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/SIMD\">SIMD<\/a> (<strong>S<\/strong>ingle <strong>I<\/strong>nstruction, <strong>M<\/strong>ultiple <strong>D<\/strong>ata) extensions of a modern processor&#8217;s instruction set. SIMD enables your processor to execute one operation in parallel on several data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A Simple Example<\/h3>\n\n\n\n<p>Whether an algorithm runs in a parallel and vectorized way depends on many factors. It depends on whether the CPU and the operating system support SIMD instructions. Additionally, it\u2019s a question of the compiler and the optimization level you used to compile your code.<\/p>\n\n\n\n<!-- HTML generated using hilite.me --><div style=\"background: #f0f3f3; overflow:auto;width:auto;gray;border-width:.1em .1em .1em .8em\"><pre style=\"margin: 0; line-height: 125%\"><span style=\"color: #0099FF; font-style: italic\">\/\/ SIMD.cpp<\/span>\n\n<span style=\"color: #006699; font-weight: bold\">const<\/span> <span style=\"color: #007788; font-weight: bold\">int<\/span> SIZE<span style=\"color: #555555\">=<\/span> <span style=\"color: #FF6600\">8<\/span>;\n\n<span style=\"color: #007788; font-weight: bold\">int<\/span> vec[]<span style=\"color: #555555\">=<\/span>{<span style=\"color: #FF6600\">1<\/span>,<span style=\"color: #FF6600\">2<\/span>,<span style=\"color: #FF6600\">3<\/span>,<span style=\"color: #FF6600\">4<\/span>,<span style=\"color: #FF6600\">5<\/span>,<span style=\"color: #FF6600\">6<\/span>,<span style=\"color: #FF6600\">7<\/span>,<span style=\"color: #FF6600\">8<\/span>};\n<span style=\"color: #007788; font-weight: bold\">int<\/span> res[SIZE]<span style=\"color: #555555\">=<\/span>{<span style=\"color: #FF6600\">0<\/span>,};\n\n<span style=\"color: #007788; font-weight: bold\">int<\/span> <span style=\"color: #CC00FF\">main<\/span>(){\n  <span style=\"color: #006699; font-weight: bold\">for<\/span> (<span style=\"color: #007788; font-weight: bold\">int<\/span> i<span style=\"color: #555555\">=<\/span> <span style=\"color: #FF6600\">0<\/span>; i <span style=\"color: #555555\">&lt;<\/span> SIZE; <span style=\"color: #555555\">++<\/span>i) {\n    res[i]<span style=\"color: #555555\">=<\/span> vec[i]<span style=\"color: #555555\">+<\/span><span style=\"color: #FF6600\">5<\/span>; <span style=\"color: #0099FF; font-style: italic\">\/\/ (1)<\/span>\n  }\n}\n<\/pre><\/div>\n\n\n\n<p><br>Line 1 is the key line in the small program. Thanks to the Compiler Explorer, it is quite easy to generate the assembler instructions for Clang 3.6 with and without maximum optimization (-O3).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Without Optimization<\/h4>\n\n\n\n<p>Although my time fiddling with assembler instructions is long gone, it\u2019s evident that all is done sequentially.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"282\" height=\"93\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/seq.png\" alt=\"\" class=\"wp-image-10789\"\/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">With maximum Optimization<\/h4>\n\n\n\n<p>I get instructions that run in parallel on several datasets by using maximum optimization.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"468\" height=\"138\" src=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/vec.png\" alt=\"\" class=\"wp-image-10790\" srcset=\"https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/vec.png 468w, https:\/\/www.modernescpp.com\/wp-content\/uploads\/2025\/06\/vec-300x88.png 300w\" sizes=\"auto, (max-width: 468px) 100vw, 468px\" \/><\/figure>\n\n\n\n<p>The move operation (<code>movdqa<\/code>) and the add operation (<code>paddd<\/code>) use the special registers xmm0 and xmm1. Both registers are so-called SSE registers and have a width of 128 Bits. This allows you to process 4 ints in one go. <a href=\"https:\/\/en.wikibooks.org\/wiki\/X86_Assembly\/SSE\">SSE<\/a> stands for <strong>S<\/strong>treaming <strong>S<\/strong>IMD <strong>E<\/strong>xtensions.\u00a0<\/p>\n\n\n\n<p>Unfortunately, vector instructions are highly dependent on the architecture. Neither the instructions nor the register widths are uniform. Modern Intel architectures mostly support AVX2 or even AVX-512. This enables 256-bit or 512-bit operations. This means that 8 or 16 ints can be processed in parallel. AVX stands for <strong>A<\/strong>dvanced <strong>V<\/strong>ector E<strong>x<\/strong>tension. <\/p>\n\n\n\n<p>This is exactly where the new library data-parallel types come in, offering a unified interface to vector instructions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data-parallel types (SIMD)<\/h2>\n\n\n\n<p>Before diving into the new library, a few definitions are necessary. These definitions refer to proposal <a href=\"https:\/\/www.open-std.org\/jtc1\/sc22\/wg21\/docs\/papers\/2024\/p1928r15.pdf\">P1928R15<\/a>. In total, the new library comprises six proposals.<\/p>\n\n\n\n<p><br>The set of <strong>vectorizable types <\/strong>comprises all standard integer types, character types, and the types <code>float <\/code>and <code>double<\/code>. In addition, <code>std::float16_t, std::float32_t, <\/code>and <code>std::float64_t<\/code> are vectorizable types if defined.<\/p>\n\n\n\n<p>The term<strong> data-parallel type<\/strong> refers to all enabled specializations of the <code>basic_simd <\/code>and <code>basic_simd_mask <\/code>class templates. A data-parallel object is an object of data-parallel type.<br><\/p>\n\n\n\n<p class=\"has-small-font-size\">A data-parallel type consists of one or more elements of an underlying vectorizable type, called the<strong> element type<\/strong>. The number of elements is a constant for each data-parallel type and called the width of that type. The elements in a data-parallel type are indexed from 0 to width \u2212 1.<\/p>\n\n\n\n<p class=\"has-small-font-size\">An <strong>element-wise operation <\/strong>applies a specified operation to the elements of one or more data-parallel objects. Each such application is unsequenced with respect to the others. A <strong>unary element-wise operation<\/strong> is an element-wise operation that applies a unary operation to each element of a data-parallel object. A <strong>binary element-wise operation<\/strong> is an element-wise operation that applies a binary operation to corresponding elements of two data-parallel objects.<\/p>\n\n\n\n<p class=\"has-small-font-size\">After so much theory, I would now like to show a small example. This is from Matthias Kretz, author of proposal P1928R15. The example from his presentation at CppCon 2023 shows a function <code>f <\/code>that takes a vector and maps its elements to their sine values.<\/p>\n\n\n\n<p class=\"has-small-font-size\"><\/p>\n\n\n\n<!-- HTML generated using hilite.me --><div style=\"background: #f0f3f3; overflow:auto;width:auto;gray;border-width:.1em .1em .1em .8em\"><pre style=\"margin: 0; line-height: 125%\"><span style=\"color: #007788; font-weight: bold\">void<\/span> <span style=\"color: #CC00FF\">f<\/span>(std<span style=\"color: #555555\">::<\/span>vector<span style=\"color: #555555\">&lt;<\/span><span style=\"color: #007788; font-weight: bold\">float<\/span><span style=\"color: #555555\">&gt;&amp;<\/span> data) {\n    <span style=\"color: #006699; font-weight: bold\">using<\/span> floatv <span style=\"color: #555555\">=<\/span> std<span style=\"color: #555555\">::<\/span>simd<span style=\"color: #555555\">&lt;<\/span><span style=\"color: #007788; font-weight: bold\">float<\/span><span style=\"color: #555555\">&gt;<\/span>;\n    <span style=\"color: #006699; font-weight: bold\">for<\/span> (<span style=\"color: #006699; font-weight: bold\">auto<\/span> it <span style=\"color: #555555\">=<\/span> data.begin(); it <span style=\"color: #555555\">&lt;<\/span> data.end(); it <span style=\"color: #555555\">+=<\/span> floatv<span style=\"color: #555555\">::<\/span>size()) {\n        floatv v(it);\n        v <span style=\"color: #555555\">=<\/span> std<span style=\"color: #555555\">::<\/span>sin(v);\n        v.copy_to(it);\n    }\n}\n<\/pre><\/div>\n\n\n\n<p class=\"has-small-font-size\"><br>The function <code>f <\/code>takes a vector of floats (<code>data<\/code>) as a reference. It defines <code>floatv <\/code>as a SIMD vector of floats using<code> std::simd&lt;float&gt;. f <\/code>iterates through the vector in blocks, each block having the size of the SIMD vector.<\/p>\n\n\n\n<p class=\"has-small-font-size\">For each block: <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Loads the block into a SIMD vector (<code>floatv v(it);<\/code>).<\/li>\n\n\n\n<li>Applies the sine function to all elements in the SIMD vector simultaneously (<code>v = std::sin(v);<\/code>).<\/li>\n\n\n\n<li>Writes the results back to the original vector (<code>v.copy_to(it);<\/code>).<\/li>\n<\/ul>\n\n\n<p>The handling of SIMD instructions will become particularly elegant when proposal <a href=\"https:\/\/www.open-std.org\/jtc1\/sc22\/wg21\/docs\/papers\/2020\/p0350r4.pdf\">P0350R4 <\/a>is implemented in C++26. For example, SIMD can then be used as a new execution policy in algorithms:<\/p>\n\n\n<!-- HTML generated using hilite.me --><div style=\"background: #f0f3f3; overflow:auto;width:auto;gray;border-width:.1em .1em .1em .8em\"><pre style=\"margin: 0; line-height: 125%\"><span style=\"color: #007788; font-weight: bold\">void<\/span> <span style=\"color: #CC00FF\">f<\/span>(std<span style=\"color: #555555\">::<\/span>vector<span style=\"color: #555555\">&lt;<\/span><span style=\"color: #007788; font-weight: bold\">float<\/span><span style=\"color: #555555\">&gt;&amp;<\/span> data) {\n    std<span style=\"color: #555555\">::<\/span>for_each(std<span style=\"color: #555555\">::<\/span>execution<span style=\"color: #555555\">::<\/span>simd, data.begin(), data.end(), [](<span style=\"color: #006699; font-weight: bold\">auto<\/span><span style=\"color: #555555\">&amp;<\/span> v) {\n        v <span style=\"color: #555555\">=<\/span> std<span style=\"color: #555555\">::<\/span>sin(v);\n    });\n}\n<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s next?<\/h2>\n\n\n\n<p>In my next article, I&#8217;ll dive deeper into the new SIMD library.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The data-parallel types (SIMD) library provides data-parallel types and operations on them. Today, I want to take a closer look at this. Before diving into the new library, I would like to take a moment to discuss SIMD. SIMD Vectorization refers to the SIMD (Single Instruction, Multiple Data) extensions of a modern processor&#8217;s instruction set. [&hellip;]<\/p>\n","protected":false},"author":21,"featured_media":10787,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[559],"tags":[566],"class_list":["post-10786","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-c26-blog","tag-data-parallel-types"],"_links":{"self":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/10786","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/comments?post=10786"}],"version-history":[{"count":6,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/10786\/revisions"}],"predecessor-version":[{"id":10802,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/posts\/10786\/revisions\/10802"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media\/10787"}],"wp:attachment":[{"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/media?parent=10786"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/categories?post=10786"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.modernescpp.com\/index.php\/wp-json\/wp\/v2\/tags?post=10786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}